{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidqqpkbs4kbqn2zider5kubk5qopprcccukhmgjw4lwyocwmn26ji",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3moia4v6drw32"
  },
  "path": "/t/persistent-0-prompt-cache-hits-on-gpt-5-5-with-auckland-nz-cloudflare-520s-complicating-every-workaround/1383838#post_11",
  "publishedAt": "2026-06-17T10:21:01.000Z",
  "site": "https://community.openai.com",
  "tags": [
    "@dataclass"
  ],
  "textContent": "Checklist - things that are likely non-cache inputs:\n\nCalls to different models\nCalls with different service tier\nCalls with different prompt cache key\nCalls past expiry (5-60 minutes)\nCalls with framework injections of text such as UUIDs\nPrompt IDs with variables, varying prompt ID versions\nNot passing and maintaining a full chat history\nVarying or dropping encrypted reasoning, phase in output being returned\nResponses with any kind of compaction\n\nPossible: different localization routing, different organization or project, etc. OpenAI running different determinism fingerprint models on varying hardware vs 24hr retrieval, etc.\n\nThen the big one: Your actual API call, instructions + input is simply non-varying, only adding new inputs to a record of 100% fidelity.\n\nJust from and for inspiration, I asked my AI pal starting with C for some tooling, a start of inspecting past string sequences you are sending in logs or “live”. Then far more “demo” presentation than needed for a token encoder + integer list matcher when you run.\n\n\n    \"\"\"\n    token_cache_diff.py\n    -------------------\n    Compares two tiktoken-encoded integer sequences to find where their shared\n    prefix ends, and reports whether that prefix qualifies for OpenAI's prompt\n    caching discount (≥ 1024 tokens, counted in 128-token increments).\n\n    Typical use: encode your prompt at each API call and pass both encoded\n    sequences here to pinpoint where early content mutations break cache\n    eligibility between runs.\n    \"\"\"\n\n    from __future__ import annotations\n\n    import random\n    from dataclasses import dataclass\n    from typing import Optional\n\n    try:\n        import tiktoken  # only needed for the encode helper\n    except ImportError:\n        tiktoken = None  # type: ignore\n\n\n    # ─────────────────────────────────────────────────────────────────────────────\n    # Result container\n    # ─────────────────────────────────────────────────────────────────────────────\n\n    @dataclass(frozen=True)\n    class TokenDiffResult:\n        \"\"\"Outcome of comparing two token-integer sequences.\"\"\"\n\n        matching_prefix_len: int\n        \"\"\"Number of tokens identical from index 0 up to (not including) the first break.\"\"\"\n\n        divergence_index: Optional[int]\n        \"\"\"Index of the first mismatched token, or None when one sequence is a\n        clean prefix of the other (identical, extension, or truncation).\"\"\"\n\n        divergence_type: str\n        \"\"\"\n        'identical'  – sequences are byte-for-byte the same.\n        'extension'  – candidate grew beyond reference with no mutations.\n        'truncation' – candidate is shorter than reference with no mutations.\n        'mutation'   – a token value differs at divergence_index.\n        \"\"\"\n\n        divergent_tokens: Optional[tuple[int, int]]\n        \"\"\"(reference_token, candidate_token) at the divergence point, or None.\"\"\"\n\n        cache_eligible_len: int\n        \"\"\"Largest prefix length that qualifies for a caching discount.\n        0 if the matching prefix is under the minimum threshold.\"\"\"\n\n        cache_tiers_hit: int\n        \"\"\"How many 128-token cache tiers are covered by cache_eligible_len.\"\"\"\n\n\n    # ─────────────────────────────────────────────────────────────────────────────\n    # Core comparison\n    # ─────────────────────────────────────────────────────────────────────────────\n\n    def compare_token_sequences(\n        reference: list[int],\n        candidate: list[int],\n        cache_min_tokens: int = 1024,\n        cache_increment: int = 128,\n    ) -> TokenDiffResult:\n        \"\"\"\n        Compare two tiktoken integer sequences and report where they first diverge.\n\n        A shared prefix is valid only when every token from index 0 up to (but not\n        including) the first divergence is identical.  A sequence that is strictly\n        longer with no mutations is treated as a clean extension, not a mutation.\n\n        Args:\n            reference:        The baseline / earlier token sequence.\n            candidate:        The later token sequence being compared.\n            cache_min_tokens: Minimum matching prefix for a caching discount (default 1024).\n            cache_increment:  Cache-tier size in tokens (default 128).\n\n        Returns:\n            TokenDiffResult with the matching length, divergence position/type,\n            and the largest cache-eligible prefix length.\n\n        Examples:\n            >>> compare_token_sequences([1, 2, 3], [1, 2, 3]).divergence_type\n            'identical'\n            >>> compare_token_sequences([1, 2, 3], [1, 2, 3, 4]).divergence_type\n            'extension'\n            >>> compare_token_sequences([1, 2, 3, 4], [1, 2, 3]).divergence_type\n            'truncation'\n            >>> r = compare_token_sequences([1, 2, 9, 4], [1, 2, 3, 4])\n            >>> r.divergence_index, r.matching_prefix_len\n            (2, 2)\n        \"\"\"\n        min_len = min(len(reference), len(candidate))\n\n        # Walk only the overlapping portion looking for the first mismatch.\n        divergence_index: Optional[int] = None\n        for i in range(min_len):\n            if reference[i] != candidate[i]:\n                divergence_index = i\n                break\n\n        # Matching prefix is everything before the break (or the full overlap).\n        matching_prefix_len = divergence_index if divergence_index is not None else min_len\n\n        # Classify the relationship between the two sequences.\n        if divergence_index is not None:\n            divergence_type = \"mutation\"\n        elif len(reference) == len(candidate):\n            divergence_type = \"identical\"\n        elif len(candidate) > len(reference):\n            divergence_type = \"extension\"\n        else:\n            divergence_type = \"truncation\"\n\n        divergent_tokens: Optional[tuple[int, int]] = None\n        if divergence_index is not None:\n            divergent_tokens = (reference[divergence_index], candidate[divergence_index])\n\n        # Largest prefix that lands on a cache-tier boundary.\n        cache_eligible_len = 0\n        cache_tiers_hit = 0\n        if matching_prefix_len >= cache_min_tokens:\n            tiers = (matching_prefix_len - cache_min_tokens) // cache_increment\n            cache_eligible_len = cache_min_tokens + tiers * cache_increment\n            cache_tiers_hit = tiers + 1  # the first tier counts as tier 1\n\n        return TokenDiffResult(\n            matching_prefix_len=matching_prefix_len,\n            divergence_index=divergence_index,\n            divergence_type=divergence_type,\n            divergent_tokens=divergent_tokens,\n            cache_eligible_len=cache_eligible_len,\n            cache_tiers_hit=cache_tiers_hit,\n        )\n\n\n    # ─────────────────────────────────────────────────────────────────────────────\n    # Convenience wrapper — encodes text first, then compares\n    # ─────────────────────────────────────────────────────────────────────────────\n\n    def compare_text_inputs(\n        reference_text: str,\n        candidate_text: str,\n        model: str = \"gpt-4o\",\n        cache_min_tokens: int = 1024,\n        cache_increment: int = 128,\n    ) -> TokenDiffResult:\n        \"\"\"\n        Encode both strings with tiktoken and delegate to compare_token_sequences.\n\n        Args:\n            reference_text:   The earlier / baseline prompt string.\n            candidate_text:   The later prompt string to compare.\n            model:            The OpenAI model name used to select the tokeniser.\n            cache_min_tokens: Minimum matching prefix for a caching discount.\n            cache_increment:  Cache-tier size in tokens.\n\n        Returns:\n            TokenDiffResult (same as compare_token_sequences).\n\n        Raises:\n            ImportError: if tiktoken is not installed.\n        \"\"\"\n        if tiktoken is None:\n            raise ImportError(\"tiktoken is required: pip install tiktoken\")\n\n        enc = tiktoken.encoding_for_model(model)\n        return compare_token_sequences(\n            list(enc.encode(reference_text)),\n            list(enc.encode(candidate_text)),\n            cache_min_tokens=cache_min_tokens,\n            cache_increment=cache_increment,\n        )\n\n\n    # ─────────────────────────────────────────────────────────────────────────────\n    # Console display helpers\n    # ─────────────────────────────────────────────────────────────────────────────\n\n    _W = 72  # inner width of each box row (chars between the two ║ borders)\n\n\n    def _rule(char: str = \"─\") -> str:\n        return char * _W\n\n\n    def _header(title: str) -> str:\n        return (\n            f\"╔{_rule('═')}╗\\n\"\n            f\"║  {title:<{_W - 2}}║\\n\"\n            f\"╠{_rule('═')}╣\"\n        )\n\n\n    def _divider() -> str:\n        return f\"╠{_rule('═')}╣\"\n\n\n    def _footer() -> str:\n        return f\"╚{_rule('═')}╝\"\n\n\n    def _row(text: str = \"\") -> str:\n        return f\"║{text:<{_W}}║\"\n\n\n    def _body(lines: list[str]) -> str:\n        return \"\\n\".join(_row(line) for line in lines)\n\n\n    def _print_box(title: str, sections: list[list[str]]) -> None:\n        \"\"\"Print a box with a title bar and one or more content sections.\"\"\"\n        print(_header(title))\n        for i, section in enumerate(sections):\n            if i:\n                print(_divider())\n            print(_body(section))\n        print(_footer())\n\n\n    def _tier_bar(\n        eligible_len: int,\n        matched_len: int,\n        cache_min: int = 1024,\n        cache_inc: int = 128,\n    ) -> str:\n        \"\"\"Compact tier bar: ▓ = covered tier, ░ = reachable but not yet crossed.\"\"\"\n        if matched_len < cache_min:\n            return \"n/a  (below minimum threshold)\"\n        max_tiers = (matched_len - cache_min) // cache_inc + 1\n        hit_tiers = (eligible_len - cache_min) // cache_inc + 1\n        bar = \"▓\" * hit_tiers + \"░\" * (max_tiers - hit_tiers)\n        next_boundary = cache_min + hit_tiers * cache_inc\n        tokens_to_next = next_boundary - matched_len\n        suffix = f\"  (+{tokens_to_next} to tier {hit_tiers + 1})\" if tokens_to_next > 0 else \"\"\n        return f\"[{bar}]  {hit_tiers}/{max_tiers}{suffix}\"\n\n\n    def _result_rows(\n        result: TokenDiffResult,\n        ref_len: int,\n        cand_len: int,\n        cache_min: int = 1024,\n        cache_inc: int = 128,\n    ) -> list[str]:\n        \"\"\"Build the content rows for a comparison result section inside a box.\"\"\"\n        rows: list[str] = []\n        delta = cand_len - ref_len\n        sign = \"+\" if delta >= 0 else \"\"\n\n        rows.append(f\"  Reference length  : {ref_len:,} tokens\")\n        rows.append(f\"  Candidate length  : {cand_len:,} tokens  ({sign}{delta:,})\")\n        rows.append(\"\")\n\n        icon = {\n            \"identical\": \"≡\", \"extension\": \"→\", \"truncation\": \"←\", \"mutation\": \"✗\"\n        }.get(result.divergence_type, \"?\")\n        rows.append(f\"  Divergence type   : {icon}  {result.divergence_type}\")\n        rows.append(f\"  Prefix matched    : {result.matching_prefix_len:,} tokens  (raw)\")\n\n        if result.divergence_index is not None:\n            rt, ct = result.divergent_tokens  # type: ignore[misc]\n            rows.append(\n                f\"  First break       : index {result.divergence_index:,}\"\n                f\"  [ref={rt}  cand={ct}]\"\n            )\n\n        rows.append(\"\")\n\n        if result.cache_eligible_len:\n            rows.append(f\"  Cache-eligible    : {result.cache_eligible_len:,} tokens\")\n            rows.append(\n                \"  Tier bar          : \"\n                + _tier_bar(result.cache_eligible_len, result.matching_prefix_len,\n                            cache_min, cache_inc)\n            )\n            rows.append(\"\")\n            rows.append(f\"  ✓  Valid cache prefix — {result.cache_tiers_hit} tier(s) covered\")\n        else:\n            rows.append(f\"  Cache-eligible    : 0  (need >= {cache_min:,} matching tokens)\")\n            short_by = cache_min - result.matching_prefix_len\n            rows.append(\n                f\"  Raw prefix only   : {result.matching_prefix_len:,} tokens\"\n                + (f\"  (short by {short_by:,})\" if short_by > 0 else \"\")\n            )\n            rows.append(\"\")\n            rows.append(\"  ✗  No cache discount — prefix too short or mutated\")\n\n        return rows\n\n\n    # ─────────────────────────────────────────────────────────────────────────────\n    # Demo — simulated multi-turn chat context with caching diagnostics\n    # ─────────────────────────────────────────────────────────────────────────────\n\n    if __name__ == \"__main__\":\n        SEED      = 42\n        CACHE_MIN = 1024\n        CACHE_INC = 128\n        random.seed(SEED)\n\n        # ── Build simulated token sequences ──────────────────────────────────────\n        # Token IDs are random integers in the realistic GPT-4o tiktoken range.\n\n        BASE_LEN   = 1_500  # system prompt + previous conversation context\n        ROUND1_LEN = 210    # first new user message (Turn 1)\n        ROUND2_LEN = 195    # second new user message (Turn 2)\n\n        base   = [random.randint(1, 50_256) for _ in range(BASE_LEN)]\n        round1 = [random.randint(1, 50_256) for _ in range(ROUND1_LEN)]\n        round2 = [random.randint(1, 50_256) for _ in range(ROUND2_LEN)]\n\n        seq_r0 = base                        # 1,500 — initial cached context\n        seq_r1 = base + round1               # 1,710 — after Round 1\n        seq_r2 = base + round1 + round2      # 1,905 — after Round 2\n\n        # Branch: silently mutate one token deep inside the base context, then\n        # re-append the same round1 and round2 suffixes.  Total length unchanged.\n        MUTATION_IDX  = 47\n        mutated_base  = base[:]\n        mutated_base[MUTATION_IDX] = (mutated_base[MUTATION_IDX] + 999) % 50_256\n        seq_branch    = mutated_base + round1 + round2  # 1,905 — index 47 is wrong\n\n        # ── Intro ─────────────────────────────────────────────────────────────────\n        print()\n        _print_box(\n            \"  TOKEN CACHE PREFIX DIFF — MULTI-TURN DEMO\",\n            [[\n                \"  Simulated tiktoken integer sequences (no real model call needed).\",\n                \"  Each turn compares the previous full prompt against the new one,\",\n                \"  mirroring how you would call compare_token_sequences() in practice.\",\n                \"\",\n                f\"  Cache discount rule: prefix >= {CACHE_MIN:,} tokens, aligned to\",\n                f\"  {CACHE_INC}-token tiers  (1024 -> 1152 -> 1280 -> 1408 -> ...)\",\n                \"\",\n                \"  Tier bar key:  ▓ = cache tier covered   ░ = tier within reach\",\n            ]],\n        )\n        print()\n\n        # ── Turn 0: base context seeded ───────────────────────────────────────────\n        _print_box(\n            \"  TURN 0  ·  Base Context Seeded  (seed=42)\",\n            [[\n                f\"  {BASE_LEN:,} tokens generated — system prompt + prior assistant turns\",\n                \"  already present in the context window.\",\n                \"\",\n                \"  Stored as the cache reference.  No comparison yet.\",\n            ]],\n        )\n        print()\n\n        # ── Turn 1: first user round ──────────────────────────────────────────────\n        r1 = compare_token_sequences(seq_r0, seq_r1, CACHE_MIN, CACHE_INC)\n\n        _print_box(\n            \"  TURN 1  ·  Round 1 User Input  (+210 tokens appended)\",\n            [\n                [\n                    f\"  {ROUND1_LEN} new user-message tokens appended to the base context.\",\n                    \"  ref = stored cache (Turn 0)    cand = new full prompt\",\n                ],\n                _result_rows(r1, len(seq_r0), len(seq_r1), CACHE_MIN, CACHE_INC),\n            ],\n        )\n        print()\n\n        # ── Turn 2: second user round ─────────────────────────────────────────────\n        r2 = compare_token_sequences(seq_r1, seq_r2, CACHE_MIN, CACHE_INC)\n\n        tier_delta = r2.cache_tiers_hit - r1.cache_tiers_hit\n        if tier_delta > 0:\n            tier_note = (\n                f\"  ↑  +{tier_delta} tier(s) vs Turn 1 — grew past\"\n                f\" {tier_delta} x {CACHE_INC}-token boundary(s).\"\n            )\n        else:\n            tier_note = \"  —  No new cache tier boundary crossed since Turn 1.\"\n\n        _print_box(\n            \"  TURN 2  ·  Round 2 User Input  (+195 tokens appended)\",\n            [\n                [\n                    f\"  {ROUND2_LEN} more tokens appended.  Context keeps growing cleanly.\",\n                    \"  ref = Turn 1 full prompt       cand = Turn 2 full prompt\",\n                ],\n                _result_rows(r2, len(seq_r1), len(seq_r2), CACHE_MIN, CACHE_INC),\n                [tier_note],\n            ],\n        )\n        print()\n\n        # ── Turn 3: branch / mutation ─────────────────────────────────────────────\n        r3 = compare_token_sequences(seq_r2, seq_branch, CACHE_MIN, CACHE_INC)\n\n        _print_box(\n            \"  TURN 3  ·  Branch — Early Mutation Detected\",\n            [\n                [\n                    f\"  Token at index {MUTATION_IDX} was silently changed inside the base context.\",\n                    f\"  Total length is unchanged ({len(seq_branch):,} tokens) — mutation is subtle.\",\n                    \"  ref = Turn 2 full prompt       cand = mutated branch\",\n                ],\n                _result_rows(r3, len(seq_r2), len(seq_branch), CACHE_MIN, CACHE_INC),\n                [\n                    f\"  ⚠  Prefix breaks at index {r3.divergence_index}.\"\n                    f\"  All {r2.cache_tiers_hit} previously earned tier(s) wiped.\",\n                    \"  Server must recompute the KV-cache from scratch.\",\n                    \"\",\n                    \"  Common causes of early mutation:\",\n                    \"    · Timestamp / request-ID injected into system prompt\",\n                    \"    · Dynamic fields (username, locale) placed before static content\",\n                    \"    · Tool-call results inserted ahead of the stable context block\",\n                ],\n            ],\n        )\n        print()\n\n",
  "title": "Persistent 0% prompt cache hits on GPT-5.5 with Auckland NZ Cloudflare 520s complicating every workaround"
}