Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig7z2ax4iwb5qvirazs3ndcqvqj7ztbihod3kujzxlc72qqb6hxx4",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mowvq5bsjhe2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreic2eqjlrivegqlifoyhs6ynbxh7rqo3e4xleq5ad45k3d5uxdjtse"
    },
    "mimeType": "image/webp",
    "size": 70986
  },
  "path": "/oleksandr_kuryzhev_42873f/ai-anomaly-detection-in-grafana-3-mistakes-we-made-j9a",
  "publishedAt": "2026-06-23T07:03:01.000Z",
  "site": "https://dev.to",
  "tags": [
    "monitoring",
    "devops",
    "kuryzhev.cloud",
    "Prometheus HTTP API",
    "Grafana 10.2 alerting",
    "More Prometheus and Grafana monitoring patterns in production",
    "Kubernetes deployment strategies, CronJobs, and workload management",
    "Python automation scripts for DevOps workflows and tooling"
  ],
  "textContent": "_Originally published on kuryzhev.cloud_\n\nWe replaced 200 static Prometheus threshold alerts with an AI anomaly detection model — and spent the first month making things measurably worse before we figured out why. The model fired constantly, woke people up at 3am for non-issues, then went completely silent during a real incident. This is the honest account of what went wrong and what the working architecture actually looks like now.\n\n## Context — Why We Tried AI Anomaly Detection in the First Place\n\nOur alerting stack before this experiment was a graveyard of static thresholds. CPU above 80%? Alert. Memory above 75%? Alert. P95 latency above 500ms? Alert. Every one of those numbers was picked by a human, at a point in time, for a service that has since changed completely. The result was a Prometheus setup with roughly 200 alert rules that the on-call rotation had learned to mostly ignore. Alert fatigue was real and documented — we had a Slack channel called `#alerts-noise` that received more traffic than `#incidents`.\n\nThe incident that finally pushed us to act was a gradual memory leak in a Go microservice. The leak was slow — about 12MB per hour. It stayed under every static threshold for six full hours. No alert fired. Then the service hit the container memory limit, got OOM-killed, and the cascade took down three downstream services before we caught it. The postmortem was uncomfortable. We had all the metrics. Prometheus had scraped every data point. We just had no rule that would have caught a slow, sustained drift rather than a sharp spike.\n\nThe target stack we built toward: Grafana 10.2.3, Prometheus 2.48, and a Python-based anomaly model using Isolation Forest from scikit-learn 1.4.0, deployed as a sidecar service on Kubernetes 1.28. The model would score incoming metric streams and expose those scores back to Prometheus, where Grafana alert rules would evaluate them. Clean in theory. Painful in practice.\n\n## Mistake 1 — We Trusted the Model Out of the Box Without Baseline Training Data\n\nThe first version of the model was trained on two weeks of Prometheus metrics pulled via the Prometheus HTTP API. We ran the training during a quiet period — post-holiday, low traffic, no deployments. The model learned what \"normal\" looked like during the quietest two weeks of our entire year.\n\nMonday morning arrived. Traffic ramped up as users came back online. The model had never seen a Monday morning traffic pattern. It flagged every single ramp as an anomaly. We got 400 Grafana alerts in the first week. Most of them were garbage.\n\nThe subtler problem was the `contamination` parameter. We left it at the scikit-learn default of `0.1`. What that means in practice: the Isolation Forest is mathematically instructed to label exactly 10% of all data points as anomalies, regardless of whether your data actually contains 10% anomalies. In a healthy, stable service, you might have 0.5% genuinely anomalous points. Setting `contamination=0.1` forces the model to invent the other 9.5%. It will find them. It will call your Monday morning traffic an anomaly. It will call your weekly deployment window an anomaly. It will call a lot of things anomalies, because you told it to.\n\nThe fix took about three weeks to implement properly. We retrained on 90 days of data including at least two full deploy cycles and multiple traffic peaks. We tuned `contamination` down to `0.02` — roughly matching our observed real-incident rate. We added `hour_of_week` as an explicit feature dimension so the model could learn that Monday 9am is structurally different from Sunday 3am. The false-positive rate dropped by roughly 80% after retraining.\n\n**Watch out for:** the `contamination` default. It is almost certainly wrong for your workload. Always benchmark it against your actual historical incident rate before deploying to production.\n\nHere is the training script we use now, run weekly via a Kubernetes CronJob scheduled at `0 2 * * 0` (Sunday 2am UTC):\n\n\n    # anomaly_model/train.py\n    # Trains Isolation Forest on Prometheus metric data and serializes model + scaler\n    # Run via Kubernetes CronJob weekly; outputs versioned artifacts to /models/\n\n    import os\n    import joblib\n    import requests\n    import numpy as np\n    import pandas as pd\n    from datetime import datetime, timedelta\n    from sklearn.ensemble import IsolationForest\n    from sklearn.preprocessing import StandardScaler\n\n    PROMETHEUS_URL = os.getenv(\"PROMETHEUS_URL\", \"http://prometheus-svc:9090\")\n    MODEL_OUTPUT_DIR = os.getenv(\"MODEL_OUTPUT_DIR\", \"/models\")\n    LOOKBACK_DAYS = 90\n    STEP = \"60s\"\n\n    METRICS = {\n        \"latency_p95\": 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))',\n        \"error_rate\":  'rate(http_requests_total{status=~\"5..\"}[5m])',\n        \"cpu_throttle\": 'rate(container_cpu_cfs_throttled_seconds_total[5m])',\n    }\n\n    def fetch_metric(query: str, start: datetime, end: datetime) -> pd.Series:\n        \"\"\"Query Prometheus range API and return a time-indexed Series.\"\"\"\n        resp = requests.get(f\"{PROMETHEUS_URL}/api/v1/query_range\", params={\n            \"query\": query,\n            \"start\": start.timestamp(),\n            \"end\":   end.timestamp(),\n            \"step\":  STEP,\n        }, timeout=30)\n        resp.raise_for_status()\n        result = resp.json()[\"data\"][\"result\"]\n        if not result:\n            raise ValueError(f\"No data returned for query: {query}\")\n        values = result[0][\"values\"]  # [[timestamp, value], ...]\n        ts = pd.Series(\n            {datetime.fromtimestamp(float(t)): float(v) for t, v in values},\n            name=query[:40]\n        )\n        return ts\n\n    def build_feature_matrix(end: datetime) -> pd.DataFrame:\n        \"\"\"Fetch all metrics and assemble feature matrix with engineered features.\"\"\"\n        start = end - timedelta(days=LOOKBACK_DAYS)\n        frames = {}\n        for name, query in METRICS.items():\n            frames[name] = fetch_metric(query, start, end)\n        df = pd.DataFrame(frames).dropna()\n\n        # Engineered features: hour-of-week captures weekly seasonality\n        df[\"hour_of_week\"] = df.index.dayofweek * 24 + df.index.hour\n\n        # Binary flag: mark known deploy windows (weekdays 10-12 UTC)\n        df[\"is_deploy_window\"] = (\n            (df.index.dayofweek < 5) &\n            (df.index.hour >= 10) &\n            (df.index.hour < 12)\n        ).astype(int)\n\n        return df\n\n    def train_and_save():\n        end = datetime.utcnow()\n        df = build_feature_matrix(end)\n\n        scaler = StandardScaler()\n        X_scaled = scaler.fit_transform(df.values)\n\n        model = IsolationForest(\n            n_estimators=200,\n            contamination=0.02,   # tuned: ~2% expected anomaly rate in production\n            random_state=42,       # pinned to prevent score drift on rebuild\n            n_jobs=-1,\n        )\n        model.fit(X_scaled)\n\n        # Version tag: YYYYMMDD for traceability\n        version = end.strftime(\"%Y%m%d\")\n        joblib.dump(model,  f\"{MODEL_OUTPUT_DIR}/isolation_forest_{version}.pkl\")\n        joblib.dump(scaler, f\"{MODEL_OUTPUT_DIR}/scaler_{version}.pkl\")\n\n        # Symlink \"latest\" for the serving layer to pick up without restart\n        for artifact, name in [(f\"isolation_forest_{version}.pkl\", \"model_latest.pkl\"),\n                               (f\"scaler_{version}.pkl\",           \"scaler_latest.pkl\")]:\n            link = os.path.join(MODEL_OUTPUT_DIR, name)\n            if os.path.islink(link):\n                os.remove(link)\n            os.symlink(os.path.join(MODEL_OUTPUT_DIR, artifact), link)\n\n        print(f\"[train] Model version {version} saved. Samples trained: {len(df)}\")\n\n    if __name__ == \"__main__\":\n        train_and_save()\n\n\nOne thing worth noting: the `StandardScaler` fitted on training data must be serialized alongside the model using `joblib.dump` and loaded together at inference time. Forgetting to save the scaler is one of the most common sources of silent score corruption I've seen. The model loads fine, inference runs without errors, and the scores are completely wrong. There is no exception thrown. You will not know unless you are actively monitoring the score distribution.\n\n## Mistake 2 — We Wired the Model Output Directly into PagerDuty Without a Confidence Gate\n\nEven after fixing the training data problem, the alerting integration was still a disaster. The first version piped raw anomaly scores directly into a Grafana Alert Rule with a simple threshold: if `anomaly_score > 0.6`, fire. No smoothing. No consecutive-breach requirement. No pending state window.\n\nA single anomalous data point — one 60-second scrape interval — could trigger a full PagerDuty incident. We had a case where a transient network blip caused a 30-second latency spike. The anomaly score spiked to 0.72 for exactly one evaluation cycle. PagerDuty fired. Someone got woken up. By the time they opened their laptop, the score was back at 0.15 and every metric was green. That is not a monitoring system. That is a random number generator with a pager.\n\nThe problem was that we had omitted the `for:` duration in the Grafana alert rule YAML. In Grafana 10.2 alerting, the `for:` field controls how long a condition must be continuously true before the alert transitions from `Pending` to `Firing`. Without it, the alert fires on the first breach. We found through painful trial and error that `for: 3m` — meaning three consecutive one-minute evaluation cycles above the threshold — was the minimum viable duration to suppress transient spikes without masking real incidents.\n\nWe also added a Prometheus Recording Rule to pre-aggregate the raw score into a 5-minute rolling median before Grafana ever evaluates it:\n\n\n    # /etc/grafana/provisioning/alerting/anomaly-rules.yaml\n    # Grafana 10.2 alert provisioning — two-stage anomaly alert with severity routing\n    # Apply: restart Grafana pod or POST /api/admin/provisioning/alerting/reload\n\n    apiVersion: 1\n\n    groups:\n      - orgId: 1\n        name: anomaly_detection\n        folder: AI Monitoring\n        interval: 1m   # evaluation cadence — matches Prometheus scrape interval\n\n        rules:\n          # Stage 1 — WARNING: smoothed score crosses lower threshold\n          - uid: anomaly-warn-001\n            title: \"Anomaly Score Warning — Elevated\"\n            condition: C\n            data:\n              - refId: A\n                datasourceUid: prometheus-ds\n                model:\n                  expr: job:anomaly_score:avg5m   # recording rule pre-aggregated value\n                  intervalMs: 60000\n                  maxDataPoints: 43200\n              - refId: C\n                datasourceUid: \"__expr__\"\n                model:\n                  type: threshold\n                  conditions:\n                    - evaluator:\n                        params: [0.65]\n                        type: gt\n                      query:\n                        params: [A]\n            for: 3m    # must stay above threshold for 3 consecutive evals before firing\n            labels:\n              severity: warning\n              team: platform\n            annotations:\n              summary: \"Anomaly score elevated on {{ $labels.job }}\"\n              description: >\n                Rolling 5m anomaly score is {{ $values.A.Value | printf \"%.3f\" }},\n                above warning threshold 0.65. Check Grafana dashboard: AI Anomaly Overview.\n\n          # Stage 2 — CRITICAL: high-confidence anomaly, pages on-call\n          - uid: anomaly-crit-001\n            title: \"Anomaly Score Critical — Page On-Call\"\n            condition: C\n            data:\n              - refId: A\n                datasourceUid: prometheus-ds\n                model:\n                  expr: job:anomaly_score:avg5m\n                  intervalMs: 60000\n                  maxDataPoints: 43200\n              - refId: C\n                datasourceUid: \"__expr__\"\n                model:\n                  type: threshold\n                  conditions:\n                    - evaluator:\n                        params: [0.85]\n                        type: gt\n                      query:\n                        params: [A]\n            for: 3m\n            labels:\n              severity: critical\n              team: platform\n            annotations:\n              summary: \"HIGH CONFIDENCE anomaly on {{ $labels.job }} — investigate now\"\n              description: >\n                Score {{ $values.A.Value | printf \"%.3f\" }} exceeds critical threshold 0.85.\n                Model version: {{ $labels.model_version }}.\n\n\nThe two-stage severity split — warning at `0.65` routed to Slack, critical at `0.85` routed to PagerDuty — was a deliberate choice. It gives the team visibility into developing situations without immediately escalating to an incident. In practice, the warning tier catches about 70% of real anomalies before they reach the critical threshold, giving engineers time to investigate during business hours rather than at 3am.\n\n**Watch out for:** using raw `anomaly_score` directly as the Grafana alert expression instead of a smoothed recording rule. This causes alert flapping and burns your PagerDuty incident quota faster than almost anything else. Always pre-aggregate.\n\n## Mistake 3 — The Model Serving Layer Had No Versioning, Rollback, or Drift Detection\n\nThis one was the most expensive mistake. We treated the ML model like a static config file. It lived in a Docker image tagged `latest`. No version history. No rollback procedure. No metrics about the model's own behavior.\n\nA routine base image rebuild bumped scikit-learn from 1.3.x to 1.4.0. We did not pin the version in `requirements.txt`. The change was silent — no build error, no test failure, no deployment alert. What changed internally was Isolation Forest's random seed behavior and tree-building logic. Score distributions shifted by approximately 0.08 across the board. Every metric that previously scored in the `[0.5, 0.7]` range now scored in the `[0.0, 0.4]` range. No alerts fired. For eleven days.\n\nDuring that eleven-day window, we had a real incident: Redis connection pool exhaustion caused roughly 40 minutes of degraded checkout latency in production. The anomaly model was running the entire time. It saw the metrics. It scored everything below 0.3. Nobody got paged. We found out about the degradation through a customer support ticket.\n\nThe insidious part is that there was no error message. The model did not crash. The serving endpoint returned HTTP 200 on every inference request. The scores were just wrong, and we had no way to know that without observing the score distribution over time. The silent failure mode for model drift is: scores return, but they are all low. You will not see an exception. You will see nothing, until a real incident goes undetected.\n\nThe fix required treating the model as a first-class production service, not a background utility. We now pin `scikit-learn==1.4.0` explicitly in `requirements.txt`. Docker images are tagged with the convention `anomaly-model:YYYYMMDD-GITHASH` and stored in ECR. The previous version is always retained for instant rollback via `kubectl set image deployment/anomaly-model anomaly-model=<ECR_URI>:<previous_tag>`. And the model service now exposes a `/metrics` endpoint that Prometheus scrapes, publishing `anomaly_model_score_p95`, `anomaly_rate_5m`, and `model_version_timestamp`.\n\nIf `anomaly_rate_5m` drops below 0.005 for more than 30 minutes during peak traffic hours, we get a separate alert: \"model may be silently underscoring — check for drift.\" That alert has fired twice since we added it. Both times, something real was wrong with the model configuration.\n\n## What We Do Differently Now — The Architecture That Actually Works\n\nAfter three painful rounds of iteration, the current setup is stable. Here is what changed at each layer.\n\n**Training pipeline:** The model trains on a rolling 90-day window, which at 60-second scrape intervals gives roughly 129,600 data points per metric series — about 50MB per feature in memory, entirely manageable. Features include P95 latency, error rate, CPU throttle ratio, `hour_of_week`, and a binary `is_deploy_window` flag. Parameters are fixed: `n_estimators=200`, `contamination=0.02`, `random_state=42`. The `random_state` pin is not optional — it ensures that rebuilding the image with the same data produces the same model, which makes debugging score changes tractable. Retraining runs weekly via a Kubernetes CronJob at `0 2 * * 0`.\n\n**Alert evaluation:** A Prometheus Recording Rule pre-aggregates the raw score using `avg_over_time(anomaly_score[5m])` into `job:anomaly_score:avg5m`. The Grafana alert rule evaluates this pre-aggregated metric, not the raw score. The `for: 3m` pending window filters transient spikes. Warning at `0.65` goes to Slack. Critical at `0.85` goes to PagerDuty. The Grafana notification template uses `{{ $values.A.Value | printf \"%.3f\" }}` to surface the actual score in the alert message, which makes triage significantly faster.\n\n**Model observability:** The Flask 3.0 / Gunicorn 21.2 serving app on port `8080` exposes a `/metrics` endpoint scraped by Prometheus. We track score distribution, anomaly rate, and model version timestamp as first-class metrics on a dedicated Grafana dashboard. A separate alert fires if the anomaly rate drops to near zero during expected-traffic periods — the canary for silent model drift.\n\n**Security note:** The `/metrics` endpoint must not be exposed outside the cluster. It leaks service topology and internal metric names. We enforce a Kubernetes NetworkPolicy that restricts ingress to the `/metrics` path to only the Prometheus scraper pod. This is easy to forget when you are moving fast and should be in your deployment checklist from day one.\n\nAI anomaly detection in Grafana is genuinely useful when the model is treated as a production system with the same rigor as any other service: versioned artifacts, observable internals, tested rollback paths, and alert rules with appropriate confidence gates. Skip any one of those and you will either wake people up constantly or miss real incidents silently. We learned all of this the hard way. You don't have to.\n\nFor more on building reliable monitoring pipelines, see the rest of the DevOps_DayS series at kuryzhev.cloud.\n\n## Related\n\n  * More Prometheus and Grafana monitoring patterns in production\n  * Kubernetes deployment strategies, CronJobs, and workload management\n  * Python automation scripts for DevOps workflows and tooling\n\n",
  "title": "AI Anomaly Detection in Grafana: 3 Mistakes We Made"
}