Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidloyo6pb4z7zvmjncot7mt7dmw32w4tmkkssarh3qzgzg5ppaxyu",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp346uw4cih2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigvet6obwk6bdcql5hvvwju4a5mfxdaynlrcgj75tnqv7o2mjjhrm"
    },
    "mimeType": "image/webp",
    "size": 94474
  },
  "path": "/neko1313_4/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-across-936-runs-33m8",
  "publishedAt": "2026-06-24T23:10:19.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "llm",
    "python",
    "opensource",
    "last post",
    "the graphlens post",
    "https://github.com/Neko1313/agent-context-bench",
    "graphlens",
    "docs",
    "@modelcontextprotocol"
  ],
  "textContent": "In my last post I described **graphlens** — what it does, how it works — and along the way I casually claimed that an agent \"burns tokens grepping around a repo.\" I gave exactly **zero** numbers to back that up.\n\nThis post fixes that. Here are the measurements, the data, and a reproducible harness. Spoiler: the conclusion is not the one I expected going in, and that's the interesting part.\n\n##  TL;DR\n\nI took **one** agent (Claude Code), changed **exactly one thing** — which MCP server feeds it code context — and ran it over 26 tasks on `apache/superset`. Four \"arms\": `filesystem` (grep + read), `graphlens` (structural graph), `serena` (LSP), and `codegraph`. Three models (haiku / sonnet / opus), three seeds — **936 runs**.\n\nThe headline: **the answer flips depending on the kind of task.**\n\n  * On simple \"where is X defined / what does X inherit from\" lookups, all four tools are **tied on accuracy**. The only difference is cost (~3×). graphlens is unremarkable here.\n  * On \"estimate the blast radius / find every override / disambiguate an overloaded name\" tasks, the tools **separate hard** : grep collapses (0.71 accuracy, only 83% of runs even finish, and the ones that do cost **6–24× more**), while the structural tools stay cheap and accurate.\n\n\n\nIf I'd only measured the easy tasks, I'd have written \"you don't need a graph, grep is fine.\" If only the hard ones, \"you don't need grep, get a graph.\" The truth sits in the middle, and it's about **what work you hand the agent.**\n\n##  The business case we're actually measuring\n\nPicture a familiar situation. You have a large project: hundreds of thousands of lines, a Python backend, a TypeScript front end, legacy code you're scared to touch. You wire an AI agent into it — for review, refactoring, answering questions like \"what breaks if I change this method's signature?\"\n\nThe agent can't see the whole repo at once. Something has to feed it context: which functions live where, who calls whom, what inherits from what. And here's an **architectural decision with a price tag** : what exactly do you feed it?\n\nThere are basically four classes of answer:\n\n  * **Hand it grep + read** — let it search by text and open files. Zero infrastructure, works everywhere.\n  * **Build a structural code graph** (graphlens) — entity nodes, typed edges, exact answers to \"who calls this.\"\n  * **Stand up an LSP** (serena over a language server) — what your IDE already runs on.\n  * **Use an off-the-shelf code-graph product** (codegraph).\n\n\n\nEach option costs money (tokens), time (latency), and risk (the agent gives up and hits a turn cap). `apache/superset` is an almost perfect stand-in for this case: ~400k LOC, Python + TypeScript, an `/api/v1/...` boundary between front and back. A big polyglot project — exactly when this question is worth asking.\n\nSo how much does each option cost? Let's measure.\n\n##  Experiment design: change one variable\n\nThe whole methodology rests on one principle: **fix everything except one thing.** Model, system prompt, settings, task set — constants. Only the context-providing MCP server changes. Then any difference in the numbers is the contribution of that tool, not a config accident.\n\nNo tool is designated \"the baseline to beat.\" All four are measured on equal footing, and the numbers rank them.\n\n###  The four arms\n\nArm | Context provider (MCP server) | Indexing step\n---|---|---\n`filesystem` |  `@modelcontextprotocol/server-filesystem` (read_file + grep) | none\n`graphlens` | graphlens graph over MCP | `graphlens analyze`\n`serena` | Serena (LSP) | LSP workspace warm-up\n`codegraph` | a graph-based competitor | `codegraph init`\n\nOne detail that matters for fairness: **Claude Code's built-in tools (Read / Grep / Bash, etc.) are disabled.** If you don't take them away, the agent ignores the MCP server and falls back to its usual path — and you'd be measuring the wrong thing. So the harness runs `claude -p` in a clean room: a fresh `CLAUDE_CONFIG_DIR` with only subscription credentials (no hooks, plugins, skills, memory), `--strict-mcp-config` (only this arm's server is visible), `--disallowedTools` on every built-in (an explicit _deny_ , because in headless mode an allow-list alone forbids nothing), and `--allowedTools mcp__<server>` to auto-approve the one server.\n\n###  The second axis: models\n\nIn parallel I varied the model answering the question:\n\nKey | model id\n---|---\n`haiku` | `claude-haiku-4-5`\n`sonnet` | `claude-sonnet-4-6`\n`opus` | `claude-opus-4-8`\n\nWhy a second axis becomes clear near the end: **the optimal tool depends on which model you picked.** That's probably the least obvious finding in the whole thing.\n\nTotal: 4 arms × 3 models × 26 tasks × 3 seeds = **936 runs** (on Claude Code 2.1.187).\n\n##  What counts as an honest measurement\n\nBenchmarks are easy to bend toward the conclusion you want. So the rules are fixed up front — without them the numbers aren't trustworthy:\n\n  * **Gold answers are hand-verified** against source at tag `6.0.0` (every task carries a `file:line` reference). Crucially, **gold is not generated by any tool under test** (not ty, not pyright, not graphlens itself) — otherwise the comparison is biased toward whoever's output you labelled with. Set-task gold is checked with an independent oracle: Python's `ast`.\n  * **The \"naive\" arm has hands.** `filesystem` is grep + read, not \"an agent with no tools.\" Naive ≠ toolless.\n  * **Index cost is measured separately, once.** grep pays nothing to index; a graph amortizes. You can't mix those currencies.\n  * **There is no determinism.** `temperature=0` does not make these models deterministic. So 3 seeds, and the report shows **the median, not the mean.**\n  * **Versions are recorded** — models and every MCP server — plus a price snapshot and date.\n  * **`cost_usd` is an API-equivalent, not your bill.** The subscription is flat-rate, so `cost_usd` (emitted by the CLI) is what the same tokens _would_ cost via the API. It's **not** your actual invoice, but it is a **correct relative $/task metric** for comparing arms.\n  * **Use a tool, or it doesn't count.** The system prompt forbids answering from memory; a run with zero tool calls is retried (and a stubborn refusal is tagged `__NO_TOOLS__`). Answering \"from memory\" about a well-known repo wouldn't measure the context provider.\n\n\n\nAnd separately: **failure counts as accuracy 0.** If grep hits the 50-turn cap and never produces an answer, that's not \"no data\" — it's \"the tool didn't get there within budget.\" That's how it's scored.\n\n##  The tasks: two regimes, and why you can't blend them\n\n26 tasks split into two classes.\n\n**SIMPLE — 20 pinpoint lookups** (\"where is X defined / what does X inherit from\"). One-point answers, checked by substring:\n\nKind | # | What it probes\n---|---|---\n`where_defined` | 7 | Python class → defining file\n`inherits_from` | 5 | Python class → base class\n`abstract_methods` | 1 | ABC → its abstract methods\n`ts_where_defined` | 1 | TS hook → defining file\n`ts_route_call` | 4 |  `/api/v1/...` route → the TS hook that calls it\n`xlang_link` | 2 | TS consumer → Python handler across the API boundary\n\n**HARD — 6 blast-radius and disambiguation tasks.** This is the regime where structure and semantics _should_ beat text search — and which pinpoint lookups simply can't measure:\n\nKind | # | What it probes | Scoring\n---|---|---|---\n`disambiguate` | 2 | an ambiguous bare method name (e.g. `cache_key`, defined on many classes) → _the_ right class | substring\n`overrides_count` | 2 | the full set of subclasses overriding a base method | **set F1**\n`impact_set` | 2 | every file calling a given method (the blast radius) | **set F1**\n\nSet tasks are scored by F1: reward for recall (find them all), penalty for precision (text search loves to dump every occurrence of `.get_indexes(`). Gold sets are kept small (3–5 elements, one ≈17) so they can be exhaustively checked by hand.\n\n###  Why I stratify instead of averaging\n\nThe set is **deliberately unbalanced** — 20 simple vs 6 hard. A single blended average would be entirely dictated by the easy tasks and would **hide** exactly the difference the hard ones expose. So I report each regime **separately, and never mix them.**\n\nAnd no, I deliberately don't \"balance to 50/50\" by dropping simple tasks. That would throw away data and statistical power, and open the door to cherry-picking. Stratification neutralizes the skew **without discarding data**. (General principle: if regimes give different answers, it's more honest to show both than to bury the conflict under an average.)\n\n##  Results\n\n###  SIMPLE — 20 pinpoint lookups\n\nTool | accuracy | complete | tokens | calls | $/task | sec\n---|---|---|---|---|---|---\nfilesystem | 0.97 | 100% | 1780 | 10 | $0.063 | 43\ngraphlens | 0.98 | 100% | 690 | 3 | $0.038 | 13\nserena | 0.99 | 100% | 402 | 3 | $0.031 | 20\ncodegraph | 0.99 | 100% | 372 | 1 | $0.022 | 10\n\nAccuracy is a **tie** (formally: Friedman χ²=0.40, not significant). The tools differ only on cost — a ~3× spread — and the terse ones win. **graphlens is unremarkable here** — a solid mid-pack.\n\nThis is exactly the story a benchmark that _only_ measured pinpoint lookups would tell: \"structural tools are nice, but grep nearly keeps up, and codegraph gives the cheapest answer.\" And it would be an **incomplete** truth.\n\n###  HARD — 6 blast-radius and disambiguation tasks\n\nTool | accuracy | complete | tokens | calls | $/task | sec\n---|---|---|---|---|---|---\nfilesystem | 0.71 | 83% | 12596 | 27 | $0.424 | 165\ngraphlens | 0.84 | 100% | 748 | 1 | $0.018 | 9\nserena | 0.85 | 98% | 1368 | 5 | $0.065 | 29\ncodegraph | 0.93 | 100% | 1114 | 2 | $0.036 | 16\n\nNow the tools **separate.**\n\n**grep collapses.** Lowest accuracy (0.71), only 83% of runs finish (the rest hit the 50-turn cap), and the ones that finish cost **6–24× more** ($0.42 vs $0.018–0.065) and take **6–18× longer** (~165s vs 9–29s). Text search drowns in noise when the question is \"every call to this\" or \"which of a dozen identically-named methods.\"\n\nAnd the key bit: **graphlens — the mid-pack tool on easy tasks — is here the cheapest ($0.018) and fastest (9s).** Its semantic graph finally pays off: one call instead of twenty-seven. The most _accurate_ tool is codegraph (0.93). serena is competitive (0.85).\n\nSo the same graphlens that looked unremarkable on pinpoint lookups becomes the most economical the moment the work is real — blast radius, refactoring. The ranking **inverts** between regimes.\n\n> Fairness note. MCP _resources_ are disabled for all arms. graphlens was the only server exposing resources, and in an early run the agent wandered into enumerating them and inflated cost ~24% until I denied them. All numbers above are from the clean re-run.\n\n##  Where the money goes: the mechanism is round-trips\n\nThe cost difference is mostly **how many times the agent calls the tool** , which follows from how a server slices its primitives.\n\nOn a simple \"symbol → file\" (`where_defined`), one call is enough for everyone. The gap opens on **relationship queries** — inheritance, route → handler, cross-language links. There `graphlens` chains fine-grained primitives (`find` → `neighbors` → `references`), while `codegraph` packs \"source + call paths in one shot\" (`explore` / `node`).\n\nThis isn't a difference in _what the graph knows_ — graphs know roughly the same things. It's a difference in API granularity: fewer round-trips → cheaper and faster. That's why codegraph has the efficiency edge on simple tasks, and why grep bankrupts itself on hard ones — it makes 27 round-trips where the graph needs one or two.\n\n##  Model × tool interaction: the ranking drifts with model price\n\nThis is the least obvious part. Take median $/task (across both regimes) broken down by model:\n\nTool | haiku | sonnet | opus\n---|---|---|---\nfilesystem | $0.053 | $0.080 | $0.087\ngraphlens | $0.020 | $0.041 | $0.046\nserena | $0.026 | $0.033 | $0.042\ncodegraph | $0.023 | $0.041 | $0.031\n\nCheapest-first ranking **within each model** :\n\n  * **haiku:** graphlens \\$0.020 < codegraph \\$0.023 < serena \\$0.026 < filesystem \\$0.053\n  * **sonnet:** serena \\$0.033 < graphlens \\$0.041 < codegraph \\$0.041 < filesystem \\$0.080\n  * **opus:** codegraph \\$0.031 < serena \\$0.042 < graphlens \\$0.046 < filesystem \\$0.087\n\n\n\nWatch what happens to graphlens. On **haiku it's the cheapest of all.** On **opus it becomes the most expensive of the structural tools** (still cheaper than grep, though).\n\nThe mechanism: graphlens results are **token-heavy** — graph neighborhoods, reference lists. On a cheap model that verbose context is nearly free; on an expensive one, opus prices the same tokens far higher, and verbosity hits the wallet. **serena and codegraph stay cheap on any model** because they return pinpoint results — they're robust to model choice; graphlens isn't.\n\nWhich gives the most valuable takeaway of the lot: **a cheap model on a structural tool beats an expensive model on grep.** codegraph + haiku (~$0.023, accuracy ~0.99) beats filesystem + opus (~$0.087, accuracy 0.93) on every axis at once.\n\n##  The hypothesis that didn't hold\n\nI planted the two `xlang_link` tasks as a stress test: a TS call resolves to a Python handler across the `/api/v1/...` boundary, and I was sure single-language tools would trip on it.\n\n**They didn't.** Every arm, grep included, solved both cross-language tasks. The agent steps across the boundary itself, regardless of the context provider. On this set the hypothesis failed, and I report that as loudly as the findings that held. A benchmark that only reports what it hoped to see isn't a benchmark.\n\n##  Statistics, honestly\n\nFriedman test across the four tools, over task blocks, within each regime (df=3; critical values: 0.05 → 7.82, 0.01 → 11.34):\n\n\n\n    SIMPLE:\n      accuracy  n=20  χ²= 0.40  (n.s.)    — tie\n      cost      n=20  χ²=18.42  (p<.01)   — serena < codegraph < graphlens < filesystem\n\n    HARD:\n      accuracy  n= 6  χ²= 3.50  (n.s.)    — underpowered\n      cost      n= 6  χ²=11.80  (p<.01)   — graphlens < codegraph < serena < filesystem\n\n\nWhat's honest to claim from this:\n\n  * The **cost difference is significant in both regimes** (p<.01). On HARD, graphlens is reliably the cheapest and grep reliably the most expensive. That's a solid result.\n  * The **accuracy gap on HARD is large but not statistically significant** at n=6 (χ²=3.50). It's a strong _descriptive_ signal, not a proven one. Six tasks is few.\n  * To firm up the accuracy claim you'd **add hard tasks, not cut simple ones.** Trimming the simple regime gives the hard one zero extra power — it just throws away good data.\n\n\n\nI'm leaving this in the article on purpose. The temptation to write \"graphlens/codegraph are more accurate than grep, proven\" is real, but n=6 doesn't carry it, and pretending otherwise would be dishonest.\n\n##  Index amortization: different currencies\n\nThe structural tools build an index once — **pure static work, zero LLM tokens** , wall-clock only:\n\nTool | one-time index\n---|---\nfilesystem | 0s\ncodegraph | 48s\ngraphlens | 84s\nserena | 94s\n\ngrep pays nothing up front but pays more per query. These are **different currencies** (seconds vs $/tokens), so I draw no single \"break-even point\" — that'd be a stretch. The picture is simple: the index is a one-time time cost with not a single token spent, while the $/task savings drip on every task. Over a long session the structural tools amortize; on a couple of one-off queries, grep's zero setup can win on time-to-first-answer.\n\n##  Takeaways for the business case\n\nBack to the original question: what do you feed the agent on a large project?\n\n**There is no \"this tool is always best\" answer.** There's a \"depends on what work you hand it\" answer:\n\n  * **One-off pinpoint lookups** (\"where is this class defined,\" \"what does it inherit from\"): use anything. grep keeps up, accuracy is the same, zero setup. You pay at most a small token overhead.\n  * **Sustained blast-radius work** — refactoring, impact analysis, disambiguation on a large base: structural tools cut cost **6–24×** and latency **6–18×** vs grep — and, just as important, **they don't hit the turn cap.** grep on these tasks isn't just expensive; 17% of the time it never reaches an answer at all.\n  * **Model choice interacts with tool choice.** A verbose graph is cheap on a small model and pricey on a big one. Running opus? Pick a tool with pinpoint output (codegraph, serena). Running haiku? graphlens is suddenly the cheapest.\n  * **The cheapest combo isn't \"expensive model + simple tool\" — it's \"cheap model + structural tool.\"**\n\n\n\nAnd the honest caveats, without which you can't transfer the conclusions to your project:\n\n> One repository (`apache/superset` @ 6.0.0), one harness, 26 tasks (20 simple / 6 hard). Regimes are reported separately and **never blended**. `cost_usd` is an API-equivalent, not a subscription bill. Failure = accuracy 0. This is **not a universal ranking** — it's a reproducible measurement on one concrete case.\n\n##  Where graphlens fits\n\nSince this is a follow-up to the graphlens post, let me say it straight. This benchmark does **not** prove graphlens is \"the best.\" It shows the **specific regime where its structural graph pays off** (impact analysis, cheap and fast on cheaper models), and just as plainly shows **where it lags** (on opus its verbose output costs more than codegraph and serena; codegraph is more accurate on hard tasks).\n\nFor me that's more useful than any victory lap. graphlens was built as an **engine and a precise polyglot graph model** , not a turnkey app — and the benchmark confirms exactly that: on structural questions the graph beats text search by a wide margin, and there's clear room to grow — MCP tool granularity (fewer round-trips, like codegraph) and output compactness (so it doesn't bankrupt itself on expensive models). That's my next work item, now backed by numbers instead of intuition.\n\n##  Reproduce it\n\nThe whole harness and the raw data are open. A run reassembles deterministically from `data/`.\n\n  * **Benchmark repo:** https://github.com/Neko1313/agent-context-bench\n  * See `metrics.ipynb` (all charts and per-section stats) and `README.md` (methodology).\n  * `uv run main.py` runs the full pipeline (clone superset → build indices → 936 runs, resumable within subscription limits), then open `metrics.ipynb`.\n\n\n\nIf you've got a large project of your own and the itch to run the harness on it — issues and results welcome. The more independent runs across different codebases, the closer we get to an answer that transfers, rather than \"works on superset.\"\n\n_(More on the tool itself: graphlens · docs.)_",
  "title": "How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs"
}