{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidc6sll5zckhzof5rbiixt7g4e34kzybac46e4p6ke2ol2lzj4xua",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpe4pvunej42"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiatrd6go55ixdscqxyuuelkk64y7c7oj7flqbr5y6i3wwzi4feqsa"
},
"mimeType": "image/webp",
"size": 254144
},
"path": "/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-4n8m",
"publishedAt": "2026-06-28T13:18:06.000Z",
"site": "https://dev.to",
"tags": [
"sre",
"monitoring",
"observability",
"devops",
"Nova AI Ops",
"https://novaaiops.com",
"@app.middleware"
],
"textContent": "## Four Metrics to Rule Them All\n\nGoogle's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.\n\nHere's a practical guide from someone who's implemented them across 50+ services.\n\n## Signal 1: Latency\n\nNot all latency is equal. You need to track successful requests and error requests separately.\n\n\n\n # Bad: Average latency\n latency = total_request_time / total_requests # Useless\n\n # Good: Percentile latency, separated by status\n from prometheus_client import Histogram\n\n REQUEST_LATENCY = Histogram(\n 'http_request_duration_seconds',\n 'Request latency',\n ['method', 'endpoint', 'status_class'],\n buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]\n )\n\n @app.middleware\n async def track_latency(request, call_next):\n start = time.time()\n response = await call_next(request)\n duration = time.time() - start\n status_class = f\"{response.status_code // 100}xx\"\n REQUEST_LATENCY.labels(\n method=request.method,\n endpoint=request.url.path,\n status_class=status_class\n ).observe(duration)\n return response\n\n\nAlert on p99, not p50. Your happiest users don't need help.\n\n\n\n - alert: HighLatencyP99\n expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5\n for: 5m\n labels:\n severity: warning\n\n\n## Signal 2: Traffic\n\nTraffic tells you \"is this normal?\" It's the context for every other signal.\n\n\n\n # Current request rate\n rate(http_requests_total[5m])\n\n # Compare to same time last week\n rate(http_requests_total[5m])\n /\n rate(http_requests_total[5m] offset 7d)\n\n # Alert on sudden drops (possible outage nobody noticed)\n - alert: TrafficDrop\n expr: >\n rate(http_requests_total[5m])\n <\n (rate(http_requests_total[5m] offset 1h) * 0.5)\n for: 10m\n annotations:\n summary: \"Traffic dropped >50% compared to 1 hour ago\"\n\n\nTraffic drops are often more concerning than traffic spikes.\n\n## Signal 3: Errors\n\nTrack error rate as a percentage, not absolute count:\n\n\n\n # Error rate percentage\n (\n sum(rate(http_requests_total{status=~\"5..\"}[5m]))\n /\n sum(rate(http_requests_total[5m]))\n ) * 100\n\n\nBut also track error types separately:\n\n\n\n error_categories:\n - 5xx: \"Server errors (our fault)\"\n - 4xx_excluding_404: \"Client errors (possible API issue)\"\n - timeout: \"Request timeouts\"\n - circuit_breaker: \"Dependency failures\"\n\n\n## Signal 4: Saturation\n\nThe most underrated signal. Saturation answers: \"how close are we to full?\"\n\n\n\n # CPU saturation\n process_cpu_seconds_total / container_spec_cpu_quota\n\n # Memory saturation\n container_memory_working_set_bytes / container_spec_memory_limit_bytes\n\n # Connection pool saturation\n active_connections / max_connections\n\n # Queue saturation (the one everyone forgets)\n message_queue_depth / message_queue_capacity\n\n\nAlert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.\n\n## Putting It All Together\n\nEvery service gets a standard dashboard with four rows:\n\n\n\n Row 1: Latency [p50] [p90] [p99] [error latency]\n Row 2: Traffic [rate] [vs last week] [by endpoint]\n Row 3: Errors [rate %] [by type] [by endpoint]\n Row 4: Saturation [CPU] [Memory] [Connections] [Queue]\n\n\nThis fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.\n\n## The Anti-Pattern\n\nDon't build a golden signals dashboard per service manually. Template it:\n\n\n\n {\n \"dashboard\": {\n \"title\": \"Golden Signals: {{ service_name }}\",\n \"templating\": {\n \"list\": [\n { \"name\": \"service\", \"type\": \"query\" },\n { \"name\": \"environment\", \"type\": \"custom\", \"options\": [\"prod\", \"staging\"] }\n ]\n }\n }\n }\n\n\nOne template, 50 dashboards. Update once, apply everywhere.\n\nIf you want golden signal monitoring that sets itself up automatically, check out what we're building at Nova AI Ops.\n\n**Written by Dr. Samson Tanimawo**\nBSc · MSc · MBA · PhD\nFounder & CEO, Nova AI Ops. https://novaaiops.com",
"title": "The Golden Signals: A Practical Implementation Guide"
}