Back to Blog
API authentication
API standards
API security

API Monitoring and Observability Best Practices

5 min read
J
Julia
Frontend Developer

API Monitoring and Observability Best Practices

Introduction to API Monitoring and Observability

API monitoring tracks the availability, performance, and correctness of APIs, while observability provides deeper insights into system behavior through metrics, logs, and traces. Together, they form the foundation for maintaining reliable API services.

Key differences:

  • Monitoring: Tracks known failure modes and predefined metrics
  • Observability: Enables investigation of unknown issues through rich telemetry

Core Metrics to Monitor

1. Availability Metrics

# Example availability check with Python
import requests
from datetime import datetime

def check_api_availability(url):
    try:
        start = datetime.now()
        response = requests.get(url, timeout=5)
        end = datetime.now()
        
        return {
            'timestamp': start.isoformat(),
            'status_code': response.status_code,
            'response_time_ms': (end - start).total_seconds() * 1000,
            'available': response.status_code == 200
        }
    except Exception as e:
        return {
            'timestamp': datetime.now().isoformat(),
            'error': str(e),
            'available': False
        }

2. Performance Metrics

  • Response time (p50, p90, p99)
  • Throughput (requests/second)
  • Error rates (4xx, 5xx)

3. Business Metrics

  • API call volume per customer
  • Feature usage patterns
  • Rate limit utilization

Implementing Distributed Tracing

// Node.js example with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      endpoint: 'http://jaeger-collector:14268/api/traces'
    })
  )
);

provider.register();

const httpInstrumentation = new HttpInstrumentation();
httpInstrumentation.setTracerProvider(provider);

Logging Best Practices

Structured Logging Example

// Go example with structured logging
package main

import (
	"log/slog"
	"net/http"
	"os"
	"time"
)

func main() {
	logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
	
	http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		
		// API logic here
		
		logger.Info("API request",
			"method", r.Method,
			"path", r.URL.Path,
			"duration_ms", time.Since(start).Milliseconds(),
			"status", http.StatusOK,
			"user_agent", r.UserAgent(),
		)
	})
	
	http.ListenAndServe(":8080", nil)
}

Alerting Strategies

Multi-Level Alerting

  1. Warning: Degraded performance (p95 > 500ms)
  2. Critical: API unavailable (5 consecutive failures)
  3. Business: Abnormal traffic patterns (>3σ from baseline)

Prometheus Alert Rule Example

# prometheus.rules.yml
groups:
- name: api-alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "{{ $value }}% of requests are failing"

Synthetic Monitoring

# Synthetic test with Locust
from locust import HttpUser, task, between

class ApiUser(HttpUser):
    wait_time = between(1, 5)
    
    @task
    def get_resource(self):
        with self.client.get("/api/resource/123", catch_response=True) as response:
            if response.status_code != 200:
                response.failure(f"Unexpected status: {response.status_code}")
            if response.json().get("data") is None:
                response.failure("Missing data field")

Implementing SLOs and Error Budgets

Calculating Error Budgets

Error Budget = (100% - SLO) * Time Period

Example for 99.9% monthly SLO:
Error Budget = 0.1% * 30 days = 43.2 minutes

SLO Tracking Dashboard Query

-- BigQuery example for SLO tracking
SELECT
  DATE(timestamp) as day,
  COUNT(*) as total_requests,
  SUM(CASE WHEN status_code BETWEEN 200 AND 299 THEN 1 ELSE 0 END) as success_count,
  SUM(CASE WHEN status_code BETWEEN 200 AND 299 THEN 0 ELSE 1 END) as error_count,
  (SUM(CASE WHEN status_code BETWEEN 200 AND 299 THEN 1 ELSE 0 END) / COUNT(*)) as availability
FROM api_requests
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day
ORDER BY day

Correlation IDs for Debugging

// Spring Boot correlation ID example
@Configuration
public class CorrelationConfig implements WebMvcConfigurer {
    
    @Bean
    public Filter correlationIdFilter() {
        return new OncePerRequestFilter() {
            @Override
            protected void doFilterInternal(HttpServletRequest request, 
                                         HttpServletResponse response, 
                                         FilterChain filterChain) {
                String correlationId = request.getHeader("X-Correlation-ID");
                if (correlationId == null) {
                    correlationId = UUID.randomUUID().toString();
                }
                
                MDC.put("correlationId", correlationId);
                response.setHeader("X-Correlation-ID", correlationId);
                
                try {
                    filterChain.doFilter(request, response);
                } finally {
                    MDC.remove("correlationId");
                }
            }
        };
    }
}

API Change Impact Monitoring

Breaking Change Detection

// Contract test example with Pact
const { Pact } = require('@pact-foundation/pact');
const { eachLike, like } = require('@pact-foundation/pact').Matchers;

describe("API Contract", () => {
  const provider = new Pact({
    consumer: "WebApp",
    provider: "UserService",
  });

  before(() => provider.setup());
  after(() => provider.finalize());

  describe("GET /user/{id}", () => {
    before(() => {
      return provider.addInteraction({
        state: 'user exists',
        uponReceiving: 'a request for user data',
        withRequest: {
          method: 'GET',
          path: '/user/123'
        },
        willRespondWith: {
          status: 200,
          headers: { 'Content-Type': 'application/json' },
          body: {
            id: like(123),
            name: like('John Doe'),
            email: like('john@example.com')
          }
        }
      });
    });

    it("should verify the contract", () => {
      // Test implementation
    });
  });
});

Real User Monitoring (RUM)

<!-- Browser RUM implementation -->
<script>
window.API_MONITORING = {
  track: function(apiCall) {
    const start = performance.now();
    
    return fetch(apiCall.url, apiCall.options)
      .then(response => {
        const duration = performance.now() - start;
        
        navigator.sendBeacon('/monitoring', JSON.stringify({
          api: apiCall.url,
          method: apiCall.options.method || 'GET',
          status: response.status,
          duration: duration,
          timestamp: new Date().toISOString()
        }));
        
        return response;
      });
  }
};

// Usage:
API_MONITORING.track({
  url: '/api/data',
  options: { method: 'POST' }
});
</script>

Anomaly Detection

# Python anomaly detection with Prophet
from prophet import Prophet
import pandas as pd

def detect_anomalies(metrics_data):
    df = pd.DataFrame(metrics_data)
    df.columns = ['ds', 'y']
    
    model = Prophet(interval_width=0.99)
    model.fit(df)
    
    future = model.make_future_dataframe(periods=0)
    forecast = model.predict(future)
    
    merged = df.merge(forecast, on='ds')
    anomalies = merged[(merged['y'] > merged['yhat_upper']) | 
                      (merged['y'] < merged['yhat_lower'])]
    
    return anomalies[['ds', 'y', 'yhat_lower', 'yhat_upper']]

Continuous Improvement Process

  1. Collect: Gather comprehensive telemetry
  2. Analyze: Identify patterns and root causes
  3. Prioritize: Focus on high-impact issues
  4. Remediate: Implement fixes and optimizations
  5. Verify: Confirm improvements through monitoring

For production systems, consider integrating these practices into your CI/CD pipeline to catch issues before they reach production. Start with basic availability monitoring and gradually add more sophisticated observability features as your needs evolve.

Back to Blog