External Publication
Visit Post

Prompt Caching in Claude Code: Prefix Matching Architecture and Cost Implications

Hugging Face Forums [Unofficial] May 6, 2026
Source

Abstract

Claude Code implements prompt caching through a prefix-matching architecture that reuses KV cache entries across API calls. This post examines the system design decisions behind this approach — why prefix matching was chosen over content-addressed caching, how lazy cache loading interacts with streaming, the Cache-Safe Forking pattern for parallel workloads, and the immutability constraint that makes the whole system tractable. Understanding these design decisions helps practitioners structure prompts to maximize cache hit rates and reduce inference costs by up to 90% on cached tokens.


Architecture Overview

Anthropic’s prompt caching operates at the KV (key-value) cache layer of the transformer. When a prompt is processed, the attention keys and values computed for each token are stored. On subsequent requests, if the leading tokens of the new prompt match a cached prefix exactly, those KV entries are reused rather than recomputed.

The critical design choice here is prefix matching rather than content-addressed or segment-based caching. This means:

  • Cache hits require the prompt to start with the same token sequence as a previously cached prompt

  • A match at position N requires all positions 0 through N-1 to also match

  • Any modification to the prefix — even a single token — invalidates the cache for all subsequent positions

This is a deliberate tradeoff. Prefix matching is computationally cheap to verify (a single pointer comparison per cache entry), scales linearly with context length, and maps naturally onto the autoregressive structure of transformer inference. Content-addressed caching of arbitrary segments would require indexing every possible subsequence, which is quadratic in context length.


Cache Entry Lifecycle

Cache entries are not created eagerly. The system uses lazy cache loading : a cache entry is only written when a request explicitly marks a position with a cache_control breakpoint of type ephemeral.

# Anthropic SDK — marking a cache breakpoint
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_context,          # large, stable context
                "cache_control": {"type": "ephemeral"}  # mark for caching
            },
            {
                "type": "text",
                "text": user_query               # variable, not cached
            }
        ]
    }
]

The ephemeral type signals that the prefix up to this point should be cached. The cache entry is written on the first request that includes this breakpoint, and subsequent requests that share the same prefix up to that point will hit the cache.

Cache entries have a TTL of approximately 5 minutes of inactivity. This is short enough to prevent stale entries from accumulating, but long enough to cover typical interactive session patterns where a user sends several messages within a few minutes.


The Immutable Prompt Principle

The prefix matching constraint implies a design rule that practitioners often discover the hard way: cached content must be immutable from the model’s perspective.

If you include a timestamp, a random seed, or any session-specific identifier in the cached portion of your prompt, the cache will never hit. The prefix changes on every request.

This has architectural implications for how system prompts should be structured:

[CACHED REGION — stable across requests]
- System instructions
- Tool definitions
- Large document context (codebase, knowledge base)
- Few-shot examples

[UNCACHED REGION — variable per request]
- Current date/time if needed
- Session-specific state
- User message

The boundary between cached and uncached regions should be placed at the last stable token before the first variable token. In practice, this means moving all dynamic content to the end of the prompt.


Cache-Safe Forking

A common pattern in agentic workloads is parallel tool execution : the model decides to call multiple tools simultaneously, and the orchestrator fans out those calls. Naive implementations break caching here because each parallel branch may construct a different prompt.

Cache-Safe Forking addresses this by ensuring that all parallel branches share an identical prefix up to the fork point. The fork point is the last cache_control breakpoint before the branches diverge.

Shared prefix (cached):
  [system prompt] [tool definitions] [conversation history] <cache_control>

Branch A:                    Branch B:
  [tool_result_A]              [tool_result_B]
  [next_user_message]          [next_user_message]

Both branches hit the same cache entry for the shared prefix. Only the branch-specific content (tool results) is processed fresh. This is particularly valuable in Claude Code’s agentic loop, where the model frequently reads multiple files or runs multiple shell commands in parallel.


Failure Modes

Several patterns reliably prevent cache hits:

1. Prefix mutation Any change to the cached region — including whitespace normalization, encoding differences, or template variable substitution — produces a cache miss. The comparison is at the token level, so even changes that appear semantically equivalent will miss if they tokenize differently.

2. TTL expiry between requests The 5-minute TTL means that low-frequency workflows (batch jobs, overnight runs) will rarely benefit from caching. The cache is optimized for interactive and near-real-time use cases.

3. Breakpoint placement after variable content If a cache_control breakpoint is placed after a region that contains variable content (e.g., after the user message rather than before it), the cache entry will be written but never reused, since the prefix up to that point changes on every request.

4. Model version changes Cache entries are not portable across model versions. A cache entry written against claude-sonnet-4-5 will not be reused by claude-sonnet-4-6. This matters for deployments that pin model versions and then upgrade.


Cost and Latency Implications

Anthropic’s published pricing shows cache read tokens cost approximately 10% of the standard input token price, and cache write tokens cost approximately 125% of the standard input token price (the write overhead is amortized over subsequent reads).

The break-even point for a single cache entry is roughly:

writes_to_break_even = ceil(1.25 / (1 - 0.10)) = ceil(1.25 / 0.90) ≈ 2

Any prefix that is reused more than once within its TTL window is net-positive on cost. For typical Claude Code sessions where the same codebase context is sent with every message, the effective savings on a 100K-token system prompt across a 10-message session are substantial.

Latency benefits are harder to quantify precisely because they depend on server-side cache locality, but Anthropic’s documentation notes that cache hits reduce time-to-first-token for the cached portion.


Design Decisions Worth Noting

Why not content-addressed caching? Segment-level content addressing would allow caching arbitrary substrings regardless of position. The reason prefix matching was chosen is likely a combination of implementation simplicity, predictable memory layout (cache entries are contiguous prefix slices), and alignment with how transformers actually compute attention — the KV cache for position N depends on all positions 0 through N-1, so you cannot reuse position N’s cache entry without also having positions 0 through N-1.

Why ephemeral rather than persistent? The ephemeral type (as opposed to a hypothetical persistent type) reflects the reality that most prompts contain at least some session-specific content. Requiring explicit opt-in via cache_control prevents accidental caching of sensitive or variable content.

Why a 5-minute TTL? This appears to be tuned for interactive use. It is long enough to cover a typical back-and-forth conversation (where messages arrive every 30–120 seconds) but short enough to prevent cache entries from accumulating indefinitely on the server side.


Quick Reference

Parameter Value
Cache mechanism Prefix matching on KV cache
Breakpoint type cache_control: {type: "ephemeral"}
Cache TTL ~5 minutes of inactivity
Cache read cost ~10% of standard input token price
Cache write cost ~125% of standard input token price
Break-even reuse count 2 requests
Cross-model portability No
Maximum breakpoints per request 4

Sources

Discussion in the ATmosphere

Loading comments...