Microsoft FastContext: a Repo-Explorer Subagent Cuts Coding-Agent Tokens 60%: Explorer-Subagent Context Offloading
What: The FastContext paper (Microsoft) trains a dedicated explorer subagent — a 4B-30B model the main coding agent calls to find code — that issues read-only searches and returns compact file-line citations instead of dumping files into the main context.
Why: Reading and searching a repository is the biggest single drain on a coding agent: in GPT-5.4 traces it ate 56.2% of tool-use turns and 46.5% of the main agent's tokens , so moving that work off the main agent is where the token budget is won.
vs prior: A normal coding agent greps and reads files itself , so every raw file lands in its own context window and crowds out the actual coding. FastContext offloads exploration to a separate subagent that returns only citations — the evidence, not the haystack.
Think of it as
A reference librarian you send into the stacks.
ONE CODE QUESTION
│
┌─────────────┴─────────────┐
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ READ IT │ │ SEND A │
│ YOURSELF │ │ LIBRARIAN │
│ (baseline) │ │ (FastContext) │
└────────┬────────┘ └────────┬────────┘
│ │
haul every file into explorer greps the
your own context stacks, hands back
an index card
│ │
▼ ▼
✗ ~18,000 tokens ✓ ~480 tokens of
bury the desk citations — desk
before you code stays clear
- main agent = you at a small desk, no room to pile up whole books
- explorer subagent = the librarian you send into the stacks to look
- Read / Glob / Grep = the librarian skimming many shelves in parallel
- file-line citation = an index card: shelf 3, page 88 — not the whole book
- context window = the desk; pile whole books on it and it overflows
Quick glossary
Explorer subagent — A separate model the main agent delegates a sub-task to. Here its one job is exploration: take a natural-language query, search the repo, and hand back what it found — it never writes code.
Context offloading — Keeping the bulky, raw evidence out of the main agent's context window and bringing back only a compact result. The reading still happens — just not in the context that has to do the reasoning.
Read / Glob / Grep — The three read-only tools an explorer uses: Read opens a file, Glob matches file names by pattern, Grep searches file contents. None of them change anything, so running many at once is safe.
File-line citation — A pointer of the form path/to/file.ts:88-104 — the exact place the answer lives. Returning the citation instead of the whole file is what keeps the result compact.
SFT (supervised fine-tuning) — Training a model on example (query → good exploration) pairs so it imitates them. It's the first of FastContext's two training stages.
Task-grounded RL — Reinforcement learning where the reward isn't "did the search look reasonable" but did the exploration actually help solve the downstream task. It tunes the explorer toward evidence that the main agent can act on.
Mini-SWE-Agent — A small open-source coding-agent harness. FastContext was plugged into it to measure the end-to-end effect on real software-engineering tasks.
Token budget — The total tokens an agent spends on a task — what you pay for in cost and latency. Exploration dominates it, which is why offloading it moves the number so much.
The news. On June 15, 2026 , Microsoft released FastContext , a system that attacks the most expensive thing a coding agent does: finding the right code. Analyzing GPT-5.4 trajectories, the authors found reading and searching accounted for 56.2% of tool-use turns and 46.5% of the main agent's total tokens. FastContext trains dedicated 4B-30B exploration models that the main agent queries in natural language; the explorer fires read-only
Read/Glob/Grepcalls in parallel and returns focused file-line citations. Plugged into Mini-SWE-Agent, it reports up to +5.5% resolution rate and up to 60% fewer tokens. Weights are open on Hugging Face. Read the paper →
Picture yourself at a small desk in a vast library, trying to answer one question. The naive way is to walk the stacks yourself, haul every promising book back, and stack them on the desk — and within a dozen volumes the desk is buried, the early books slide onto the floor, and you can't even see the question anymore. The desk is the bottleneck, and you filled it with raw material you mostly didn't need. A coding agent does exactly this when it greps and reads files itself: every file it opens lands in its own context window, and long before it starts writing the fix, the window is full of source it skimmed once and will never look at again.
That's not a small inefficiency — it's the inefficiency. When FastContext's authors traced real GPT-5.4 coding runs, reading and searching the repository accounted for 56.2% of every tool-use turn and 46.5% of the main agent's tokens. Roughly half the agent's entire budget goes to finding code, not changing it. And exploration is the most context-poisoning kind of work there is: it pulls in big, low-signal blobs of text whose only useful output is usually a single line number.
So FastContext stops doing the exploring on the main desk. It sends a librarian into the stacks. The main agent delegates a natural-language query — "where is the retry budget enforced?" — to a separate explorer subagent , a 4B-30B model trained for exactly this. The explorer reads, globs, and greps its way through the repo in parallel read-only calls, then hands back not an armful of files but an index card : scheduler/retry.go:88-104, the exact evidence. The main agent's desk stays clear, holding citations instead of haystacks — the reading happened, but the bulk never touched the context that has to reason. Because the explorer only ever uses read-only tools, running a swarm of those searches at once is safe by construction.
The explorer earns its accuracy in two training stages. First supervised fine-tuning teaches it to imitate good exploration traces; then task-grounded RL rewards it not for searches that merely look thorough but for evidence that actually lets the main agent solve the downstream task. A scout that brings back the wrong shelf is worse than useless, so the reward is tied to the outcome , not the search.
| Who reads the repo | What lands in the main context | Cost |
|---|---|---|
| Main agent itself (baseline) | every file it opens — raw source | ~46.5% of tokens spent exploring (paper) |
| A prompted, untrained sub-call | often the whole transcript dumped back | re-floods context; little net saving (illustrative) |
| FastContext explorer subagent | compact file-line citations only | up to 60% fewer tokens , +5.5% resolution (paper) |
Where does a 60% cut actually come from? Walk one task (token counts here are illustrative — the paper reports the percentages, not these absolute numbers). Say solving a bug needs evidence from 12 files averaging 1,500 tokens each. A baseline agent that reads them all carries 18,000 tokens of raw source in its working context — and that's before it writes a line. FastContext's explorer reads the same 12 files in its own scratch context, then returns 12 citations at ~40 tokens each = ~480 tokens. The main agent now reasons over ~480 tokens instead of 18,000 — a ~37× lighter exploration footprint on the desk that matters. Multiply that across a long task where exploration was already 46.5% of the budget , and a headline 60% token reduction stops looking surprising — it's just the haystack never landing on the desk.
Goes deeper in: AI Agents → Context Engineering → Subagents for context isolation
Related explainers
- SearchSwarm — distilling delegation into the weights — a different lever on the same problem: bake decomposition-and-delegation into one model's weights, rather than splitting off a separate explorer at runtime
- Grep vs vector retrieval for agentic search — what the explorer is actually doing under the hood when it greps the repo instead of embedding it
- GrepSeek — GRPO-trained shell-command search — another search agent trained with RL to use shell tools well
FAQ
What is explorer-subagent context offloading?
It's a pattern where a coding agent doesn't search the codebase itself but delegates the search to a separate "explorer" model. The explorer reads and greps files in its own context, then returns only compact pointers — file paths and line ranges — to the main agent. The bulky raw source never enters the main agent's context window, which is what frees up its budget for the actual coding. FastContext trains that explorer (SFT plus task-grounded RL) at 4B-30B scale.
Why does it cut tokens so much?
Because finding code is the dominant cost. In FastContext's analysis of GPT-5.4 traces, reading and searching was 56.2% of tool-use turns and 46.5% of the main agent's tokens. Most of that text is low-signal — its only useful output is a line number. Offloading the reading to a subagent that returns citations instead of files removes the haystack from the main context, which is where the up-to-60% token reduction comes from.
How is this different from SearchSwarm's distilled delegation?
Both reduce context pressure through delegation, but at different layers. SearchSwarm bakes task-decomposition-and-delegation into one model's weights via supervised fine-tuning, so a single model delegates by reflex. FastContext keeps two separate agents at inference time: a general main agent plus a specialized read-only explorer it calls for context. One trains the behavior into a model; the other architects it into the system.
Originally posted on Learn AI Visually.
Discussion in the ATmosphere