{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreieoimyd4rms5agiwrjmcwap5jk67bfr6zih26tl3eeoyqz5gxkz24",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpcuhohdx2d2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiatrd6go55ixdscqxyuuelkk64y7c7oj7flqbr5y6i3wwzi4feqsa"
},
"mimeType": "image/webp",
"size": 254144
},
"path": "/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-32bb",
"publishedAt": "2026-06-28T01:17:57.000Z",
"site": "https://dev.to",
"tags": [
"kubernetes",
"observability",
"monitoring",
"sre",
"Nova AI Ops",
"https://novaaiops.com"
],
"textContent": "## The Kubernetes Monitoring Maze\n\nKubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.\n\nAfter running K8s in production for four years, here's what actually matters.\n\n## The Three Layers\n\nKubernetes observability has three distinct layers, and you need different strategies for each:\n\n\n\n Layer 1: Cluster Health (infrastructure)\n Layer 2: Workload Health (your apps)\n Layer 3: Application Performance (user experience)\n\n\n## Layer 1: Cluster Health\n\nThese are your \"is the platform working?\" metrics:\n\n\n\n critical_cluster_metrics:\n nodes:\n - node_ready_status # Are all nodes healthy?\n - node_cpu_utilization # Alert at 85%\n - node_memory_utilization # Alert at 90%\n - node_disk_pressure # Boolean alert\n - node_pid_pressure # Rarely fires, always critical\n\n control_plane:\n - apiserver_request_latency_p99 # Alert > 1s\n - etcd_disk_wal_fsync_duration # Alert > 100ms\n - scheduler_pending_pods # Alert if > 0 for 5min\n - controller_manager_queue_depth # Alert if growing\n\n\n**Pro tip:** Don't alert on individual node CPU. Alert on cluster-level capacity:\n\n\n\n # Alert when cluster is 80% utilized\n (\n sum(node_cpu_seconds_total{mode!=\"idle\"})\n /\n sum(node_cpu_seconds_total)\n ) > 0.80\n\n\n## Layer 2: Workload Health\n\nThis is where most teams get it wrong. They monitor pods instead of workloads.\n\n\n\n critical_workload_metrics:\n deployments:\n - available_replicas < desired_replicas # For > 5min\n - deployment_generation != observed_generation # Stuck rollout\n\n pods:\n - restart_count increasing # CrashLoopBackOff detection\n - container_oom_killed # Memory limits too low\n - pod_pending_duration > 2min # Scheduling issues\n\n hpa:\n - current_replicas == max_replicas # Scale ceiling hit\n - cpu_utilization_vs_target # Consistently above target\n\n\nThe most valuable alert I ever wrote:\n\n\n\n # Detect pods stuck in CrashLoopBackOff\n alert: PodCrashLooping\n expr: rate(kube_pod_container_status_restarts_total[15m]) > 0\n for: 15m\n labels:\n severity: warning\n annotations:\n summary: \"Pod {{ $labels.pod }} is crash-looping\"\n description: \"{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes\"\n\n\n## Layer 3: Application Performance\n\nThis is what your users actually care about:\n\n\n\n application_metrics:\n red_method: # Rate, Errors, Duration\n - request_rate_per_second\n - error_rate_percentage # Alert > 1%\n - request_duration_p99 # Alert > 500ms\n\n use_method: # Utilization, Saturation, Errors\n - cpu_request_vs_limit_ratio\n - memory_request_vs_limit_ratio\n - network_receive_bytes_rate\n\n\n## The Dashboard That Saves Us\n\nWe built a single \"K8s Health\" dashboard with four panels:\n\n 1. **Cluster capacity** — CPU/Memory/Disk utilization per node pool\n 2. **Workload status** — Table of all deployments with health status\n 3. **Error rates** — All services, sorted by error rate\n 4. **Recent events** — K8s events filtered to warnings and errors\n\n\n\nThis one dashboard answers 90% of \"is something wrong?\" questions.\n\n## Common Mistakes\n\n 1. **Monitoring pods instead of services** — Pods are ephemeral, services are what matter\n 2. **Not setting resource requests** — Without requests, your metrics are meaningless\n 3. **Alerting on resource usage instead of SLOs** — High CPU isn't a problem if latency is fine\n 4. **Ignoring the control plane** — An unhealthy API server affects everything\n\n\n\nIf you want unified Kubernetes observability without the complexity, check out what we're building at Nova AI Ops.\n\n**Written by Dr. Samson Tanimawo**\nBSc · MSc · MBA · PhD\nFounder & CEO, Nova AI Ops. https://novaaiops.com",
"title": "Kubernetes Observability: What to Monitor and Why"
}