External Publication

Numerical instability when finetuning deberta-v3-small

Hugging Face Forums [Unofficial] March 20, 2026

I’m trying to reproduce the results from the SANDWiCH word sense disambiguation paper. To do this I’m fine tuning a DebertaV2ForSequenceClassification model with microsoft/deberta-v3-small as the base model, and the same training parameters as given in the paper. However, I keep seeing numerical stability issues. As the charts below show, at some point in the training there is a huge spike in the loss, after which the gradient norm becomes NaN. What can I do to diagnose and resolve this?

Discussion in the ATmosphere