The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live
Hey everyone!
I just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).
The question: Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?
The finding: At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers 40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (23-25% each). You need both, not just one.
Coming next:
Part 1B: Does precision matching between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction).
Part 2: Kernel fusion (torch.compile) as a separate divergence source
Part 3: vLLM Triton kernel effects
Full writeup here:
huggingface.co
The Three Horsemen of Numerical Divergence in Hybrid Models
A Blog post by Jen Wei on Hugging Face
Compute contributions welcome — currently fighting CoLab for a stable A100 instance.
Discussion in the ATmosphere