{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiexkhqyhyu7shdmiscohvfnhphxqpw4uafieyd5kvhbohqmwieylm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miicxkazna52"
},
"path": "/t/the-three-horsemen-of-numerical-divergence-in-hybrid-models-part-1-now-live/174870#post_1",
"publishedAt": "2026-04-01T23:57:39.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"huggingface.co",
"The Three Horsemen of Numerical Divergence in Hybrid Models"
],
"textContent": "Hey everyone!\n\nI just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).\n\n**The question:** Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?\n\n**The finding:** At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers ~40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (~23-25% each). You need both, not just one.\n\n**Coming next:**\n\n * Part 1B: Does precision _matching_ between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction).\n\n * Part 2: Kernel fusion (torch.compile) as a separate divergence source\n\n * Part 3: vLLM Triton kernel effects\n\n\n\n\nFull writeup here:\n\nhuggingface.co\n\n### The Three Horsemen of Numerical Divergence in Hybrid Models\n\nA Blog post by Jen Wei on Hugging Face\n\nCompute contributions welcome — currently fighting CoLab for a stable A100 instance.",
"title": "The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live"
}