Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigfuqx7tqqlnr6m3lzvid34v45p47aitidifac3sktvcvuf54yrrq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhy5sw7rsqv2"
  },
  "path": "/t/inverse-correlation-during-in-training-evaluation-low-token-accuracy-and-high-ifeval-accuracy-with-reversed-results-in-post-training-evaluation/174664#post_1",
  "publishedAt": "2026-03-26T15:53:44.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I have a bit of a puzzle here and would be happy to hear from the knowledgeable people around here.\n\nI examine the effect of training a LoRA adapter with different additional percentage of replay buffer, I employ both in and post training evaluations using the IFEval strict prompt evaluation. I understand that the generation flow is completely different between in and post training, but I was expecting different behaviors, explanations with experiments I can run to verify them would be highly expected.\n\nI’m using SFTConfig and SFTTrainer from the trl package for all configuration and experiments. Overall I ran 4 training runs with 0%, 5%, 18% and 50% replay buffer (X% addition to the original dataset in terms of training examples), the replay buffer was taken from the SMOL2 dataset.\n\nI see two phenomenons which I’m puzzled about:\n\n  1. Low token accuracy is correlated with higher IFeval accuracy - I’ve anticipated that lower token accuracy means less worse generation quality and more instruction following errors, in reality the 5% buffer has significantly lower token accuracy but higher instruction following accuracy.\n  2. No correlation between the in and post training IFEval accuracy, in the after training IFEval evaluation the 50% buffer reached the best performance (0.675) while the 5% and 18% got less (0.66) and the no buffer got the lowest (0.61). While post-training results make sense they are not correlated with the in-training accuracy as the no-buffer is lowest on all tests while the top performing variant is not the same on both in and post training.\n\n\n\nAdding the W&B monitoring for reference.",
  "title": "Inverse correlation during in-training evaluation: low token accuracy and high IFEval accuracy with reversed results in post-training evaluation"
}