{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigz3hfcum7f5gkpcwo54ggnx473wyvcughgpcboc33u46umzvysjy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhgqadkbypr2"
},
"path": "/t/perplexity-vs-bpb-giving-opposite-rankings-across-tokenizers-how-to-evaluate/174406#post_1",
"publishedAt": "2026-03-19T18:43:51.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"https://github.com/zyberg2091/LocalLLM"
],
"textContent": "Hi, I have been running small-scale LM training experiments and came across a result I am trying to interpret more carefully.\n\nI trained two transformer LMs with different tokenizers and architectures and observed a consistent divergence:\n\n * The model with a **larger vocabulary** shows **higher perplexity**\n\n * But the same model achieves **lower bits-per-byte (BPB)**\n\n\n\n\nSo depending on whether evaluation is token-based (perplexity) or byte-based (BPB), the relative ranking of the models flips.\n\nMy current understanding is that this could be due to:\n\n * Larger tokens being harder to predict (hurting perplexity)\n\n * But more efficient in representing raw text (improving BPB)\n\n\n\n\nHowever, I am not fully confident whether:\n\n * This behavior is expected under proper controls\n\n * Or if my setup is introducing unintended confounds (e.g., architecture differences, context length, etc.)\n\n\n\n\nI would really appreciate input on:\n\n * Whether this divergence is expected when tokenizers differ\n\n * What the correct way to compare such models is in practice\n\n * Whether BPB is generally preferred in cross-tokenizer evaluation or if comparisons should be normalized differently\n\n\n\n\nRepo (for context):\nhttps://github.com/zyberg2091/LocalLLM",
"title": "Perplexity vs BPB giving opposite rankings across tokenizers : how to evaluate"
}