External Publication

The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live

Hugging Face Forums [Unofficial] April 1, 2026

Hey everyone!

I just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).

The question: Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?

The finding: At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers ~~40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (~~23-25% each). You need both, not just one.

Coming next:

Part 1B: Does precision matching between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction).
Part 2: Kernel fusion (torch.compile) as a separate divergence source
Part 3: vLLM Triton kernel effects

Full writeup here:

huggingface.co

The Three Horsemen of Numerical Divergence in Hybrid Models

A Blog post by Jen Wei on Hugging Face

Compute contributions welcome — currently fighting CoLab for a stable A100 instance.

The Three Horsemen of Numerical Divergence in Hybrid Models

Discussion in the ATmosphere