Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidc6sll5zckhzof5rbiixt7g4e34kzybac46e4p6ke2ol2lzj4xua",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpe4pvunej42"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiatrd6go55ixdscqxyuuelkk64y7c7oj7flqbr5y6i3wwzi4feqsa"
    },
    "mimeType": "image/webp",
    "size": 254144
  },
  "path": "/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-4n8m",
  "publishedAt": "2026-06-28T13:18:06.000Z",
  "site": "https://dev.to",
  "tags": [
    "sre",
    "monitoring",
    "observability",
    "devops",
    "Nova AI Ops",
    "https://novaaiops.com",
    "@app.middleware"
  ],
  "textContent": "##  Four Metrics to Rule Them All\n\nGoogle's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.\n\nHere's a practical guide from someone who's implemented them across 50+ services.\n\n##  Signal 1: Latency\n\nNot all latency is equal. You need to track successful requests and error requests separately.\n\n\n\n    # Bad: Average latency\n    latency = total_request_time / total_requests  # Useless\n\n    # Good: Percentile latency, separated by status\n    from prometheus_client import Histogram\n\n    REQUEST_LATENCY = Histogram(\n        'http_request_duration_seconds',\n        'Request latency',\n        ['method', 'endpoint', 'status_class'],\n        buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]\n    )\n\n    @app.middleware\n    async def track_latency(request, call_next):\n        start = time.time()\n        response = await call_next(request)\n        duration = time.time() - start\n        status_class = f\"{response.status_code // 100}xx\"\n        REQUEST_LATENCY.labels(\n            method=request.method,\n            endpoint=request.url.path,\n            status_class=status_class\n        ).observe(duration)\n        return response\n\n\nAlert on p99, not p50. Your happiest users don't need help.\n\n\n\n    - alert: HighLatencyP99\n      expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5\n      for: 5m\n      labels:\n        severity: warning\n\n\n##  Signal 2: Traffic\n\nTraffic tells you \"is this normal?\" It's the context for every other signal.\n\n\n\n    # Current request rate\n    rate(http_requests_total[5m])\n\n    # Compare to same time last week\n    rate(http_requests_total[5m])\n      /\n    rate(http_requests_total[5m] offset 7d)\n\n    # Alert on sudden drops (possible outage nobody noticed)\n    - alert: TrafficDrop\n      expr: >\n        rate(http_requests_total[5m])\n        <\n        (rate(http_requests_total[5m] offset 1h) * 0.5)\n      for: 10m\n      annotations:\n        summary: \"Traffic dropped >50% compared to 1 hour ago\"\n\n\nTraffic drops are often more concerning than traffic spikes.\n\n##  Signal 3: Errors\n\nTrack error rate as a percentage, not absolute count:\n\n\n\n    # Error rate percentage\n    (\n      sum(rate(http_requests_total{status=~\"5..\"}[5m]))\n      /\n      sum(rate(http_requests_total[5m]))\n    ) * 100\n\n\nBut also track error types separately:\n\n\n\n    error_categories:\n      - 5xx: \"Server errors (our fault)\"\n      - 4xx_excluding_404: \"Client errors (possible API issue)\"\n      - timeout: \"Request timeouts\"\n      - circuit_breaker: \"Dependency failures\"\n\n\n##  Signal 4: Saturation\n\nThe most underrated signal. Saturation answers: \"how close are we to full?\"\n\n\n\n    # CPU saturation\n    process_cpu_seconds_total / container_spec_cpu_quota\n\n    # Memory saturation\n    container_memory_working_set_bytes / container_spec_memory_limit_bytes\n\n    # Connection pool saturation\n    active_connections / max_connections\n\n    # Queue saturation (the one everyone forgets)\n    message_queue_depth / message_queue_capacity\n\n\nAlert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.\n\n##  Putting It All Together\n\nEvery service gets a standard dashboard with four rows:\n\n\n\n    Row 1: Latency   [p50] [p90] [p99] [error latency]\n    Row 2: Traffic    [rate] [vs last week] [by endpoint]\n    Row 3: Errors     [rate %] [by type] [by endpoint]\n    Row 4: Saturation [CPU] [Memory] [Connections] [Queue]\n\n\nThis fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.\n\n##  The Anti-Pattern\n\nDon't build a golden signals dashboard per service manually. Template it:\n\n\n\n    {\n      \"dashboard\": {\n        \"title\": \"Golden Signals: {{ service_name }}\",\n        \"templating\": {\n          \"list\": [\n            { \"name\": \"service\", \"type\": \"query\" },\n            { \"name\": \"environment\", \"type\": \"custom\", \"options\": [\"prod\", \"staging\"] }\n          ]\n        }\n      }\n    }\n\n\nOne template, 50 dashboards. Update once, apply everywhere.\n\nIf you want golden signal monitoring that sets itself up automatically, check out what we're building at Nova AI Ops.\n\n**Written by Dr. Samson Tanimawo**\nBSc · MSc · MBA · PhD\nFounder & CEO, Nova AI Ops. https://novaaiops.com",
  "title": "The Golden Signals: A Practical Implementation Guide"
}