The Three Horsemen of Numerical Divergence in Hybrid Models — Part 1 Now Live
UPDATE: Part B is now live!
If Part A was about finding the “Horsemen,” Part B is about realizing our yardstick might be broken. I scaled the study to 8 prompts and shifted the focus to the production reality: BF16 Autoregressive (AR) rollouts.
The TL;DR on Part B:
The “Low-Res” Filter: Using BF16 for rollouts is essentially passing a low-resolution filter over your target. Precision lost in the early recurrent layers of the GDN cannot be fully recovered by just throwing FP32 at the training (TF) step.
Matched or Higher Precision is King: I tested the “MiniMax Ambiguity.” It turns out that upcasting the LM head only during the training step (TF) is significantly less effective than upcasting it in both the rollout and the training. You have to bake the precision into the inference engine, not just the trainer.
The 45% Ceiling: Even with the “best” fixes, we only recover about 45% of the total KL divergence. There is no single silver bullet here—divergence is a distributed problem.
I’ve updated the original post with the Heatmap and the Sequence Length scaling charts. Check out the “Part B” section for the full breakdown of why your RL rewards might be flatlining due to “blurry” targets.
Still fighting for A100s, but the data doesn’t lie!
Discussion in the ATmosphere