External Publication
Visit Post

The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live

Hugging Face Forums [Unofficial] April 1, 2026
Source

Hey everyone!

I just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).

The question: Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?

The finding: At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers 40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (23-25% each). You need both, not just one.

Coming next:

  • Part 1B: Does precision matching between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction).

  • Part 2: Kernel fusion (torch.compile) as a separate divergence source

  • Part 3: vLLM Triton kernel effects

Full writeup here:

huggingface.co

The Three Horsemen of Numerical Divergence in Hybrid Models

A Blog post by Jen Wei on Hugging Face

Compute contributions welcome — currently fighting CoLab for a stable A100 instance.

Discussion in the ATmosphere

Loading comments...