Numerical instability when finetuning deberta-v3-small
Hugging Face Forums [Unofficial]
March 20, 2026
I’m trying to reproduce the results from the SANDWiCH word sense disambiguation paper. To do this I’m fine tuning a DebertaV2ForSequenceClassification model with microsoft/deberta-v3-small as the base model, and the same training parameters as given in the paper.
However, I keep seeing numerical stability issues. As the charts below show, at some point in the training there is a huge spike in the loss, after which the gradient norm becomes NaN. What can I do to diagnose and resolve this?
Discussion in the ATmosphere