{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigfuqx7tqqlnr6m3lzvid34v45p47aitidifac3sktvcvuf54yrrq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhy5sw7rsqv2"
},
"path": "/t/inverse-correlation-during-in-training-evaluation-low-token-accuracy-and-high-ifeval-accuracy-with-reversed-results-in-post-training-evaluation/174664#post_1",
"publishedAt": "2026-03-26T15:53:44.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I have a bit of a puzzle here and would be happy to hear from the knowledgeable people around here.\n\nI examine the effect of training a LoRA adapter with different additional percentage of replay buffer, I employ both in and post training evaluations using the IFEval strict prompt evaluation. I understand that the generation flow is completely different between in and post training, but I was expecting different behaviors, explanations with experiments I can run to verify them would be highly expected.\n\nI’m using SFTConfig and SFTTrainer from the trl package for all configuration and experiments. Overall I ran 4 training runs with 0%, 5%, 18% and 50% replay buffer (X% addition to the original dataset in terms of training examples), the replay buffer was taken from the SMOL2 dataset.\n\nI see two phenomenons which I’m puzzled about:\n\n 1. Low token accuracy is correlated with higher IFeval accuracy - I’ve anticipated that lower token accuracy means less worse generation quality and more instruction following errors, in reality the 5% buffer has significantly lower token accuracy but higher instruction following accuracy.\n 2. No correlation between the in and post training IFEval accuracy, in the after training IFEval evaluation the 50% buffer reached the best performance (0.675) while the 5% and 18% got less (0.66) and the no buffer got the lowest (0.61). While post-training results make sense they are not correlated with the in-training accuracy as the no-buffer is lowest on all tests while the top performing variant is not the same on both in and post training.\n\n\n\nAdding the W&B monitoring for reference.",
"title": "Inverse correlation during in-training evaluation: low token accuracy and high IFEval accuracy with reversed results in post-training evaluation"
}