Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieoimyd4rms5agiwrjmcwap5jk67bfr6zih26tl3eeoyqz5gxkz24",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpcuhohdx2d2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiatrd6go55ixdscqxyuuelkk64y7c7oj7flqbr5y6i3wwzi4feqsa"
    },
    "mimeType": "image/webp",
    "size": 254144
  },
  "path": "/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-32bb",
  "publishedAt": "2026-06-28T01:17:57.000Z",
  "site": "https://dev.to",
  "tags": [
    "kubernetes",
    "observability",
    "monitoring",
    "sre",
    "Nova AI Ops",
    "https://novaaiops.com"
  ],
  "textContent": "##  The Kubernetes Monitoring Maze\n\nKubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.\n\nAfter running K8s in production for four years, here's what actually matters.\n\n##  The Three Layers\n\nKubernetes observability has three distinct layers, and you need different strategies for each:\n\n\n\n    Layer 1: Cluster Health (infrastructure)\n    Layer 2: Workload Health (your apps)\n    Layer 3: Application Performance (user experience)\n\n\n##  Layer 1: Cluster Health\n\nThese are your \"is the platform working?\" metrics:\n\n\n\n    critical_cluster_metrics:\n      nodes:\n        - node_ready_status        # Are all nodes healthy?\n        - node_cpu_utilization     # Alert at 85%\n        - node_memory_utilization  # Alert at 90%\n        - node_disk_pressure       # Boolean alert\n        - node_pid_pressure        # Rarely fires, always critical\n\n      control_plane:\n        - apiserver_request_latency_p99  # Alert > 1s\n        - etcd_disk_wal_fsync_duration   # Alert > 100ms\n        - scheduler_pending_pods         # Alert if > 0 for 5min\n        - controller_manager_queue_depth # Alert if growing\n\n\n**Pro tip:** Don't alert on individual node CPU. Alert on cluster-level capacity:\n\n\n\n    # Alert when cluster is 80% utilized\n    (\n      sum(node_cpu_seconds_total{mode!=\"idle\"})\n      /\n      sum(node_cpu_seconds_total)\n    ) > 0.80\n\n\n##  Layer 2: Workload Health\n\nThis is where most teams get it wrong. They monitor pods instead of workloads.\n\n\n\n    critical_workload_metrics:\n      deployments:\n        - available_replicas < desired_replicas  # For > 5min\n        - deployment_generation != observed_generation  # Stuck rollout\n\n      pods:\n        - restart_count increasing       # CrashLoopBackOff detection\n        - container_oom_killed            # Memory limits too low\n        - pod_pending_duration > 2min     # Scheduling issues\n\n      hpa:\n        - current_replicas == max_replicas  # Scale ceiling hit\n        - cpu_utilization_vs_target         # Consistently above target\n\n\nThe most valuable alert I ever wrote:\n\n\n\n    # Detect pods stuck in CrashLoopBackOff\n    alert: PodCrashLooping\n    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0\n    for: 15m\n    labels:\n      severity: warning\n    annotations:\n      summary: \"Pod {{ $labels.pod }} is crash-looping\"\n      description: \"{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes\"\n\n\n##  Layer 3: Application Performance\n\nThis is what your users actually care about:\n\n\n\n    application_metrics:\n      red_method:  # Rate, Errors, Duration\n        - request_rate_per_second\n        - error_rate_percentage        # Alert > 1%\n        - request_duration_p99         # Alert > 500ms\n\n      use_method:  # Utilization, Saturation, Errors\n        - cpu_request_vs_limit_ratio\n        - memory_request_vs_limit_ratio\n        - network_receive_bytes_rate\n\n\n##  The Dashboard That Saves Us\n\nWe built a single \"K8s Health\" dashboard with four panels:\n\n  1. **Cluster capacity** — CPU/Memory/Disk utilization per node pool\n  2. **Workload status** — Table of all deployments with health status\n  3. **Error rates** — All services, sorted by error rate\n  4. **Recent events** — K8s events filtered to warnings and errors\n\n\n\nThis one dashboard answers 90% of \"is something wrong?\" questions.\n\n##  Common Mistakes\n\n  1. **Monitoring pods instead of services** — Pods are ephemeral, services are what matter\n  2. **Not setting resource requests** — Without requests, your metrics are meaningless\n  3. **Alerting on resource usage instead of SLOs** — High CPU isn't a problem if latency is fine\n  4. **Ignoring the control plane** — An unhealthy API server affects everything\n\n\n\nIf you want unified Kubernetes observability without the complexity, check out what we're building at Nova AI Ops.\n\n**Written by Dr. Samson Tanimawo**\nBSc · MSc · MBA · PhD\nFounder & CEO, Nova AI Ops. https://novaaiops.com",
  "title": "Kubernetes Observability: What to Monitor and Why"
}