{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidqqpkbs4kbqn2zider5kubk5qopprcccukhmgjw4lwyocwmn26ji",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3moia4v6drw32"
},
"path": "/t/persistent-0-prompt-cache-hits-on-gpt-5-5-with-auckland-nz-cloudflare-520s-complicating-every-workaround/1383838#post_11",
"publishedAt": "2026-06-17T10:21:01.000Z",
"site": "https://community.openai.com",
"tags": [
"@dataclass"
],
"textContent": "Checklist - things that are likely non-cache inputs:\n\nCalls to different models\nCalls with different service tier\nCalls with different prompt cache key\nCalls past expiry (5-60 minutes)\nCalls with framework injections of text such as UUIDs\nPrompt IDs with variables, varying prompt ID versions\nNot passing and maintaining a full chat history\nVarying or dropping encrypted reasoning, phase in output being returned\nResponses with any kind of compaction\n\nPossible: different localization routing, different organization or project, etc. OpenAI running different determinism fingerprint models on varying hardware vs 24hr retrieval, etc.\n\nThen the big one: Your actual API call, instructions + input is simply non-varying, only adding new inputs to a record of 100% fidelity.\n\nJust from and for inspiration, I asked my AI pal starting with C for some tooling, a start of inspecting past string sequences you are sending in logs or “live”. Then far more “demo” presentation than needed for a token encoder + integer list matcher when you run.\n\n\n \"\"\"\n token_cache_diff.py\n -------------------\n Compares two tiktoken-encoded integer sequences to find where their shared\n prefix ends, and reports whether that prefix qualifies for OpenAI's prompt\n caching discount (≥ 1024 tokens, counted in 128-token increments).\n\n Typical use: encode your prompt at each API call and pass both encoded\n sequences here to pinpoint where early content mutations break cache\n eligibility between runs.\n \"\"\"\n\n from __future__ import annotations\n\n import random\n from dataclasses import dataclass\n from typing import Optional\n\n try:\n import tiktoken # only needed for the encode helper\n except ImportError:\n tiktoken = None # type: ignore\n\n\n # ─────────────────────────────────────────────────────────────────────────────\n # Result container\n # ─────────────────────────────────────────────────────────────────────────────\n\n @dataclass(frozen=True)\n class TokenDiffResult:\n \"\"\"Outcome of comparing two token-integer sequences.\"\"\"\n\n matching_prefix_len: int\n \"\"\"Number of tokens identical from index 0 up to (not including) the first break.\"\"\"\n\n divergence_index: Optional[int]\n \"\"\"Index of the first mismatched token, or None when one sequence is a\n clean prefix of the other (identical, extension, or truncation).\"\"\"\n\n divergence_type: str\n \"\"\"\n 'identical' – sequences are byte-for-byte the same.\n 'extension' – candidate grew beyond reference with no mutations.\n 'truncation' – candidate is shorter than reference with no mutations.\n 'mutation' – a token value differs at divergence_index.\n \"\"\"\n\n divergent_tokens: Optional[tuple[int, int]]\n \"\"\"(reference_token, candidate_token) at the divergence point, or None.\"\"\"\n\n cache_eligible_len: int\n \"\"\"Largest prefix length that qualifies for a caching discount.\n 0 if the matching prefix is under the minimum threshold.\"\"\"\n\n cache_tiers_hit: int\n \"\"\"How many 128-token cache tiers are covered by cache_eligible_len.\"\"\"\n\n\n # ─────────────────────────────────────────────────────────────────────────────\n # Core comparison\n # ─────────────────────────────────────────────────────────────────────────────\n\n def compare_token_sequences(\n reference: list[int],\n candidate: list[int],\n cache_min_tokens: int = 1024,\n cache_increment: int = 128,\n ) -> TokenDiffResult:\n \"\"\"\n Compare two tiktoken integer sequences and report where they first diverge.\n\n A shared prefix is valid only when every token from index 0 up to (but not\n including) the first divergence is identical. A sequence that is strictly\n longer with no mutations is treated as a clean extension, not a mutation.\n\n Args:\n reference: The baseline / earlier token sequence.\n candidate: The later token sequence being compared.\n cache_min_tokens: Minimum matching prefix for a caching discount (default 1024).\n cache_increment: Cache-tier size in tokens (default 128).\n\n Returns:\n TokenDiffResult with the matching length, divergence position/type,\n and the largest cache-eligible prefix length.\n\n Examples:\n >>> compare_token_sequences([1, 2, 3], [1, 2, 3]).divergence_type\n 'identical'\n >>> compare_token_sequences([1, 2, 3], [1, 2, 3, 4]).divergence_type\n 'extension'\n >>> compare_token_sequences([1, 2, 3, 4], [1, 2, 3]).divergence_type\n 'truncation'\n >>> r = compare_token_sequences([1, 2, 9, 4], [1, 2, 3, 4])\n >>> r.divergence_index, r.matching_prefix_len\n (2, 2)\n \"\"\"\n min_len = min(len(reference), len(candidate))\n\n # Walk only the overlapping portion looking for the first mismatch.\n divergence_index: Optional[int] = None\n for i in range(min_len):\n if reference[i] != candidate[i]:\n divergence_index = i\n break\n\n # Matching prefix is everything before the break (or the full overlap).\n matching_prefix_len = divergence_index if divergence_index is not None else min_len\n\n # Classify the relationship between the two sequences.\n if divergence_index is not None:\n divergence_type = \"mutation\"\n elif len(reference) == len(candidate):\n divergence_type = \"identical\"\n elif len(candidate) > len(reference):\n divergence_type = \"extension\"\n else:\n divergence_type = \"truncation\"\n\n divergent_tokens: Optional[tuple[int, int]] = None\n if divergence_index is not None:\n divergent_tokens = (reference[divergence_index], candidate[divergence_index])\n\n # Largest prefix that lands on a cache-tier boundary.\n cache_eligible_len = 0\n cache_tiers_hit = 0\n if matching_prefix_len >= cache_min_tokens:\n tiers = (matching_prefix_len - cache_min_tokens) // cache_increment\n cache_eligible_len = cache_min_tokens + tiers * cache_increment\n cache_tiers_hit = tiers + 1 # the first tier counts as tier 1\n\n return TokenDiffResult(\n matching_prefix_len=matching_prefix_len,\n divergence_index=divergence_index,\n divergence_type=divergence_type,\n divergent_tokens=divergent_tokens,\n cache_eligible_len=cache_eligible_len,\n cache_tiers_hit=cache_tiers_hit,\n )\n\n\n # ─────────────────────────────────────────────────────────────────────────────\n # Convenience wrapper — encodes text first, then compares\n # ─────────────────────────────────────────────────────────────────────────────\n\n def compare_text_inputs(\n reference_text: str,\n candidate_text: str,\n model: str = \"gpt-4o\",\n cache_min_tokens: int = 1024,\n cache_increment: int = 128,\n ) -> TokenDiffResult:\n \"\"\"\n Encode both strings with tiktoken and delegate to compare_token_sequences.\n\n Args:\n reference_text: The earlier / baseline prompt string.\n candidate_text: The later prompt string to compare.\n model: The OpenAI model name used to select the tokeniser.\n cache_min_tokens: Minimum matching prefix for a caching discount.\n cache_increment: Cache-tier size in tokens.\n\n Returns:\n TokenDiffResult (same as compare_token_sequences).\n\n Raises:\n ImportError: if tiktoken is not installed.\n \"\"\"\n if tiktoken is None:\n raise ImportError(\"tiktoken is required: pip install tiktoken\")\n\n enc = tiktoken.encoding_for_model(model)\n return compare_token_sequences(\n list(enc.encode(reference_text)),\n list(enc.encode(candidate_text)),\n cache_min_tokens=cache_min_tokens,\n cache_increment=cache_increment,\n )\n\n\n # ─────────────────────────────────────────────────────────────────────────────\n # Console display helpers\n # ─────────────────────────────────────────────────────────────────────────────\n\n _W = 72 # inner width of each box row (chars between the two ║ borders)\n\n\n def _rule(char: str = \"─\") -> str:\n return char * _W\n\n\n def _header(title: str) -> str:\n return (\n f\"╔{_rule('═')}╗\\n\"\n f\"║ {title:<{_W - 2}}║\\n\"\n f\"╠{_rule('═')}╣\"\n )\n\n\n def _divider() -> str:\n return f\"╠{_rule('═')}╣\"\n\n\n def _footer() -> str:\n return f\"╚{_rule('═')}╝\"\n\n\n def _row(text: str = \"\") -> str:\n return f\"║{text:<{_W}}║\"\n\n\n def _body(lines: list[str]) -> str:\n return \"\\n\".join(_row(line) for line in lines)\n\n\n def _print_box(title: str, sections: list[list[str]]) -> None:\n \"\"\"Print a box with a title bar and one or more content sections.\"\"\"\n print(_header(title))\n for i, section in enumerate(sections):\n if i:\n print(_divider())\n print(_body(section))\n print(_footer())\n\n\n def _tier_bar(\n eligible_len: int,\n matched_len: int,\n cache_min: int = 1024,\n cache_inc: int = 128,\n ) -> str:\n \"\"\"Compact tier bar: ▓ = covered tier, ░ = reachable but not yet crossed.\"\"\"\n if matched_len < cache_min:\n return \"n/a (below minimum threshold)\"\n max_tiers = (matched_len - cache_min) // cache_inc + 1\n hit_tiers = (eligible_len - cache_min) // cache_inc + 1\n bar = \"▓\" * hit_tiers + \"░\" * (max_tiers - hit_tiers)\n next_boundary = cache_min + hit_tiers * cache_inc\n tokens_to_next = next_boundary - matched_len\n suffix = f\" (+{tokens_to_next} to tier {hit_tiers + 1})\" if tokens_to_next > 0 else \"\"\n return f\"[{bar}] {hit_tiers}/{max_tiers}{suffix}\"\n\n\n def _result_rows(\n result: TokenDiffResult,\n ref_len: int,\n cand_len: int,\n cache_min: int = 1024,\n cache_inc: int = 128,\n ) -> list[str]:\n \"\"\"Build the content rows for a comparison result section inside a box.\"\"\"\n rows: list[str] = []\n delta = cand_len - ref_len\n sign = \"+\" if delta >= 0 else \"\"\n\n rows.append(f\" Reference length : {ref_len:,} tokens\")\n rows.append(f\" Candidate length : {cand_len:,} tokens ({sign}{delta:,})\")\n rows.append(\"\")\n\n icon = {\n \"identical\": \"≡\", \"extension\": \"→\", \"truncation\": \"←\", \"mutation\": \"✗\"\n }.get(result.divergence_type, \"?\")\n rows.append(f\" Divergence type : {icon} {result.divergence_type}\")\n rows.append(f\" Prefix matched : {result.matching_prefix_len:,} tokens (raw)\")\n\n if result.divergence_index is not None:\n rt, ct = result.divergent_tokens # type: ignore[misc]\n rows.append(\n f\" First break : index {result.divergence_index:,}\"\n f\" [ref={rt} cand={ct}]\"\n )\n\n rows.append(\"\")\n\n if result.cache_eligible_len:\n rows.append(f\" Cache-eligible : {result.cache_eligible_len:,} tokens\")\n rows.append(\n \" Tier bar : \"\n + _tier_bar(result.cache_eligible_len, result.matching_prefix_len,\n cache_min, cache_inc)\n )\n rows.append(\"\")\n rows.append(f\" ✓ Valid cache prefix — {result.cache_tiers_hit} tier(s) covered\")\n else:\n rows.append(f\" Cache-eligible : 0 (need >= {cache_min:,} matching tokens)\")\n short_by = cache_min - result.matching_prefix_len\n rows.append(\n f\" Raw prefix only : {result.matching_prefix_len:,} tokens\"\n + (f\" (short by {short_by:,})\" if short_by > 0 else \"\")\n )\n rows.append(\"\")\n rows.append(\" ✗ No cache discount — prefix too short or mutated\")\n\n return rows\n\n\n # ─────────────────────────────────────────────────────────────────────────────\n # Demo — simulated multi-turn chat context with caching diagnostics\n # ─────────────────────────────────────────────────────────────────────────────\n\n if __name__ == \"__main__\":\n SEED = 42\n CACHE_MIN = 1024\n CACHE_INC = 128\n random.seed(SEED)\n\n # ── Build simulated token sequences ──────────────────────────────────────\n # Token IDs are random integers in the realistic GPT-4o tiktoken range.\n\n BASE_LEN = 1_500 # system prompt + previous conversation context\n ROUND1_LEN = 210 # first new user message (Turn 1)\n ROUND2_LEN = 195 # second new user message (Turn 2)\n\n base = [random.randint(1, 50_256) for _ in range(BASE_LEN)]\n round1 = [random.randint(1, 50_256) for _ in range(ROUND1_LEN)]\n round2 = [random.randint(1, 50_256) for _ in range(ROUND2_LEN)]\n\n seq_r0 = base # 1,500 — initial cached context\n seq_r1 = base + round1 # 1,710 — after Round 1\n seq_r2 = base + round1 + round2 # 1,905 — after Round 2\n\n # Branch: silently mutate one token deep inside the base context, then\n # re-append the same round1 and round2 suffixes. Total length unchanged.\n MUTATION_IDX = 47\n mutated_base = base[:]\n mutated_base[MUTATION_IDX] = (mutated_base[MUTATION_IDX] + 999) % 50_256\n seq_branch = mutated_base + round1 + round2 # 1,905 — index 47 is wrong\n\n # ── Intro ─────────────────────────────────────────────────────────────────\n print()\n _print_box(\n \" TOKEN CACHE PREFIX DIFF — MULTI-TURN DEMO\",\n [[\n \" Simulated tiktoken integer sequences (no real model call needed).\",\n \" Each turn compares the previous full prompt against the new one,\",\n \" mirroring how you would call compare_token_sequences() in practice.\",\n \"\",\n f\" Cache discount rule: prefix >= {CACHE_MIN:,} tokens, aligned to\",\n f\" {CACHE_INC}-token tiers (1024 -> 1152 -> 1280 -> 1408 -> ...)\",\n \"\",\n \" Tier bar key: ▓ = cache tier covered ░ = tier within reach\",\n ]],\n )\n print()\n\n # ── Turn 0: base context seeded ───────────────────────────────────────────\n _print_box(\n \" TURN 0 · Base Context Seeded (seed=42)\",\n [[\n f\" {BASE_LEN:,} tokens generated — system prompt + prior assistant turns\",\n \" already present in the context window.\",\n \"\",\n \" Stored as the cache reference. No comparison yet.\",\n ]],\n )\n print()\n\n # ── Turn 1: first user round ──────────────────────────────────────────────\n r1 = compare_token_sequences(seq_r0, seq_r1, CACHE_MIN, CACHE_INC)\n\n _print_box(\n \" TURN 1 · Round 1 User Input (+210 tokens appended)\",\n [\n [\n f\" {ROUND1_LEN} new user-message tokens appended to the base context.\",\n \" ref = stored cache (Turn 0) cand = new full prompt\",\n ],\n _result_rows(r1, len(seq_r0), len(seq_r1), CACHE_MIN, CACHE_INC),\n ],\n )\n print()\n\n # ── Turn 2: second user round ─────────────────────────────────────────────\n r2 = compare_token_sequences(seq_r1, seq_r2, CACHE_MIN, CACHE_INC)\n\n tier_delta = r2.cache_tiers_hit - r1.cache_tiers_hit\n if tier_delta > 0:\n tier_note = (\n f\" ↑ +{tier_delta} tier(s) vs Turn 1 — grew past\"\n f\" {tier_delta} x {CACHE_INC}-token boundary(s).\"\n )\n else:\n tier_note = \" — No new cache tier boundary crossed since Turn 1.\"\n\n _print_box(\n \" TURN 2 · Round 2 User Input (+195 tokens appended)\",\n [\n [\n f\" {ROUND2_LEN} more tokens appended. Context keeps growing cleanly.\",\n \" ref = Turn 1 full prompt cand = Turn 2 full prompt\",\n ],\n _result_rows(r2, len(seq_r1), len(seq_r2), CACHE_MIN, CACHE_INC),\n [tier_note],\n ],\n )\n print()\n\n # ── Turn 3: branch / mutation ─────────────────────────────────────────────\n r3 = compare_token_sequences(seq_r2, seq_branch, CACHE_MIN, CACHE_INC)\n\n _print_box(\n \" TURN 3 · Branch — Early Mutation Detected\",\n [\n [\n f\" Token at index {MUTATION_IDX} was silently changed inside the base context.\",\n f\" Total length is unchanged ({len(seq_branch):,} tokens) — mutation is subtle.\",\n \" ref = Turn 2 full prompt cand = mutated branch\",\n ],\n _result_rows(r3, len(seq_r2), len(seq_branch), CACHE_MIN, CACHE_INC),\n [\n f\" ⚠ Prefix breaks at index {r3.divergence_index}.\"\n f\" All {r2.cache_tiers_hit} previously earned tier(s) wiped.\",\n \" Server must recompute the KV-cache from scratch.\",\n \"\",\n \" Common causes of early mutation:\",\n \" · Timestamp / request-ID injected into system prompt\",\n \" · Dynamic fields (username, locale) placed before static content\",\n \" · Tool-call results inserted ahead of the stable context block\",\n ],\n ],\n )\n print()\n\n",
"title": "Persistent 0% prompt cache hits on GPT-5.5 with Auckland NZ Cloudflare 520s complicating every workaround"
}