{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiexkhqyhyu7shdmiscohvfnhphxqpw4uafieyd5kvhbohqmwieylm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mihvifdz4k72"
  },
  "path": "/t/the-three-horsemen-of-numerical-divergence-in-hybrid-models-part-1-now-live/174870#post_1",
  "publishedAt": "2026-04-01T23:57:39.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "huggingface.co",
    "The Three Horsemen of Numerical Divergence in Hybrid Models"
  ],
  "textContent": "Hey everyone!\n\nI just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).\n\n**The question:** Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?\n\n**The finding:** At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers ~40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (~23-25% each). You need both, not just one.\n\n**Coming next:**\n\n  * Part 1B: Does precision _matching_ between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction).\n\n  * Part 2: Kernel fusion (torch.compile) as a separate divergence source\n\n  * Part 3: vLLM Triton kernel effects\n\n\n\n\nFull writeup here:\n\nhuggingface.co\n\n### The Three Horsemen of Numerical Divergence in Hybrid Models\n\nA Blog post by Jen Wei on Hugging Face\n\nCompute contributions welcome  — currently fighting CoLab for a stable A100 instance.",
  "title": "The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live"
}