External Publication

Perplexity vs BPB giving opposite rankings across tokenizers : how to evaluate

Hugging Face Forums [Unofficial] March 19, 2026

Hi, I have been running small-scale LM training experiments and came across a result I am trying to interpret more carefully.

I trained two transformer LMs with different tokenizers and architectures and observed a consistent divergence:

So depending on whether evaluation is token-based (perplexity) or byte-based (BPB), the relative ranking of the models flips.

My current understanding is that this could be due to:

However, I am not fully confident whether:

This behavior is expected under proper controls
Or if my setup is introducing unintended confounds (e.g., architecture differences, context length, etc.)

I would really appreciate input on:

Whether this divergence is expected when tokenizers differ
What the correct way to compare such models is in practice
Whether BPB is generally preferred in cross-tokenizer evaluation or if comparisons should be normalized differently