External Publication
Visit Post

Perplexity vs BPB giving opposite rankings across tokenizers : how to evaluate

Hugging Face Forums [Unofficial] March 19, 2026
Source

Hi, I have been running small-scale LM training experiments and came across a result I am trying to interpret more carefully.

I trained two transformer LMs with different tokenizers and architectures and observed a consistent divergence:

  • The model with a larger vocabulary shows higher perplexity

  • But the same model achieves lower bits-per-byte (BPB)

So depending on whether evaluation is token-based (perplexity) or byte-based (BPB), the relative ranking of the models flips.

My current understanding is that this could be due to:

  • Larger tokens being harder to predict (hurting perplexity)

  • But more efficient in representing raw text (improving BPB)

However, I am not fully confident whether:

  • This behavior is expected under proper controls

  • Or if my setup is introducing unintended confounds (e.g., architecture differences, context length, etc.)

I would really appreciate input on:

  • Whether this divergence is expected when tokenizers differ

  • What the correct way to compare such models is in practice

  • Whether BPB is generally preferred in cross-tokenizer evaluation or if comparisons should be normalized differently

Repo (for context): https://github.com/zyberg2091/LocalLLM

Discussion in the ATmosphere

Loading comments...