Perplexity vs BPB giving opposite rankings across tokenizers : how to evaluate
Hi, I have been running small-scale LM training experiments and came across a result I am trying to interpret more carefully.
I trained two transformer LMs with different tokenizers and architectures and observed a consistent divergence:
The model with a larger vocabulary shows higher perplexity
But the same model achieves lower bits-per-byte (BPB)
So depending on whether evaluation is token-based (perplexity) or byte-based (BPB), the relative ranking of the models flips.
My current understanding is that this could be due to:
Larger tokens being harder to predict (hurting perplexity)
But more efficient in representing raw text (improving BPB)
However, I am not fully confident whether:
This behavior is expected under proper controls
Or if my setup is introducing unintended confounds (e.g., architecture differences, context length, etc.)
I would really appreciate input on:
Whether this divergence is expected when tokenizers differ
What the correct way to compare such models is in practice
Whether BPB is generally preferred in cross-tokenizer evaluation or if comparisons should be normalized differently
Repo (for context): https://github.com/zyberg2091/LocalLLM
Discussion in the ATmosphere