External Publication

Inverse correlation during in-training evaluation: low token accuracy and high IFEval accuracy with reversed results in post-training evaluation

Hugging Face Forums [Unofficial] March 26, 2026

I have a bit of a puzzle here and would be happy to hear from the knowledgeable people around here. I examine the effect of training a LoRA adapter with different additional percentage of replay buffer, I employ both in and post training evaluations using the IFEval strict prompt evaluation. I understand that the generation flow is completely different between in and post training, but I was expecting different behaviors, explanations with experiments I can run to verify them would be highly expected. I’m using SFTConfig and SFTTrainer from the trl package for all configuration and experiments. Overall I ran 4 training runs with 0%, 5%, 18% and 50% replay buffer (X% addition to the original dataset in terms of training examples), the replay buffer was taken from the SMOL2 dataset. I see two phenomenons which I’m puzzled about: 1. Low token accuracy is correlated with higher IFeval accuracy - I’ve anticipated that lower token accuracy means less worse generation quality and more instruction following errors, in reality the 5% buffer has significantly lower token accuracy but higher instruction following accuracy. 2. No correlation between the in and post training IFEval accuracy, in the after training IFEval evaluation the 50% buffer reached the best performance (0.675) while the 5% and 18% got less (0.66) and the no buffer got the lowest (0.61). While post-training results make sense they are not correlated with the in-training accuracy as the no-buffer is lowest on all tests while the top performing variant is not the same on both in and post training. Adding the W&B monitoring for reference.

Discussion in the ATmosphere