{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigz3hfcum7f5gkpcwo54ggnx473wyvcughgpcboc33u46umzvysjy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhgwxlksihn2"
  },
  "path": "/t/perplexity-vs-bpb-giving-opposite-rankings-across-tokenizers-how-to-evaluate/174406#post_1",
  "publishedAt": "2026-03-19T18:43:51.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/zyberg2091/LocalLLM"
  ],
  "textContent": "Hi, I have been running small-scale LM training experiments and came across a result I am trying to interpret more carefully.\n\nI trained two transformer LMs with different tokenizers and architectures and observed a consistent divergence:\n\n  * The model with a **larger vocabulary** shows **higher perplexity**\n\n  * But the same model achieves **lower bits-per-byte (BPB)**\n\n\n\n\nSo depending on whether evaluation is token-based (perplexity) or byte-based (BPB), the relative ranking of the models flips.\n\nMy current understanding is that this could be due to:\n\n  * Larger tokens being harder to predict (hurting perplexity)\n\n  * But more efficient in representing raw text (improving BPB)\n\n\n\n\nHowever, I am not fully confident whether:\n\n  * This behavior is expected under proper controls\n\n  * Or if my setup is introducing unintended confounds (e.g., architecture differences, context length, etc.)\n\n\n\n\nI would really appreciate input on:\n\n  * Whether this divergence is expected when tokenizers differ\n\n  * What the correct way to compare such models is in practice\n\n  * Whether BPB is generally preferred in cross-tokenizer evaluation or if comparisons should be normalized differently\n\n\n\n\nRepo (for context):\nhttps://github.com/zyberg2091/LocalLLM",
  "title": "Perplexity vs BPB giving opposite rankings across tokenizers : how to evaluate"
}