{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicfxdssdbplbypkidzzfl76oyrpzen5u44iyjjmnfxru42beyk2ke",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi2aximtiu42"
  },
  "path": "/t/inverse-correlation-during-in-training-evaluation-low-token-accuracy-and-high-ifeval-accuracy-with-reversed-results-in-post-training-evaluation/174664#post_2",
  "publishedAt": "2026-03-27T12:16:01.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "arXiv",
    "GitHub"
  ],
  "textContent": "I ran some tests on Colab. I wasn’t able to reproduce the phenomenon entirely, but I was able to reproduce it partially.\n\n* * *\n\nWhat you are seeing is coherent. It is not a paradox. It is a case where **the training-time metric and the evaluation-time metric are measuring different abilities** , and replay is changing the balance between them. TRL’s `SFTTrainer` is still optimizing next-token prediction, and for prompt-completion data it computes loss on the completion tokens only by default. IFEval strict, by contrast, scores whether a freely generated answer satisfies **all** verifiable instructions in a prompt, over a 541-prompt benchmark with 25 instruction types. Those two objectives can disagree sharply. (Hugging Face)\n\n## The background\n\n### What `mean_token_accuracy` is really measuring\n\nIn TRL SFT, `mean_token_accuracy` is a **teacher-forced top-1 token match** on labeled tokens. In plain terms, the model sees the gold prefix and is rewarded for predicting the exact next target token. For prompt-completion datasets, the trainer computes loss on the completion only unless you disable `completion_only_loss`. That makes the metric mainly a measure of **reference imitation** , not of free-generation behavior. (Hugging Face)\n\n### What IFEval strict is really measuring\n\nIFEval strict is a **prompt-level all-constraints-pass** metric. A prompt is counted as correct only if every verifiable instruction is followed. The benchmark was built around things like required keywords, exact bullet counts, formatting, case changes, punctuation, start/end constraints, and length constraints. The paper also defines a **loose** version because strict scoring is brittle to harmless details such as markdown markers, a leading line like “Sure, here it is:”, or a trailing line like “Hope it helps.” (arXiv)\n\nThat already explains most of your first phenomenon. A model can be worse at reproducing one labeled completion token-by-token, yet better at satisfying the benchmark’s explicit output constraints. (Hugging Face)\n\n## Why lower token accuracy can come with higher IFEval\n\nThe key is **one reference vs many valid outputs**.\n\nTeacher-forced token accuracy assumes there is one preferred continuation. IFEval does not. If the prompt says “write exactly four bullet points” or “end with a postscript starting with P.S.” then there are many valid answers. A model can generate a very different answer from the reference and still satisfy the prompt perfectly. In that case, token accuracy falls while IFEval rises. That is exactly the kind of separation the benchmark was designed to expose. (arXiv)\n\nYour replay source makes this even more likely. The SmolTalk dataset card says the `smol-constraints` subset trains models to follow explicit constraints such as fixed numbers of sentences or words and required words in the output, and that it was decontaminated against IFEval. That means your replay is not generic replay. It is replay with training signal that overlaps strongly with the _type of behavior_ IFEval rewards. (Hugging Face)\n\nSo a result like “5% replay has worse token accuracy but better IFEval” is not odd. A small amount of replay can push the model away from the exact reference completions while improving structural obedience to prompts. (Hugging Face)\n\n## Why in-training and post-training IFEval can disagree\n\nThere are two broad reasons.\n\n### 1. Real learning-dynamics differences\n\nReplay changes the effective training objective. With no replay, the adapter can fit the new dataset more aggressively, but it can also forget prior instruction-following habits more aggressively. With more replay, the adapter may preserve those habits better, but adapt more slowly to the new distribution. That means one replay ratio can look best early, while another looks best at the end. Your pattern of 5% looking stronger in one phase and 50% winning the final evaluation fits that logic. (Hugging Face)\n\n### 2. Evaluation-pipeline differences\n\nThis is the part I would be most careful about. Public TRL issue reports show that users trying to compute generation-based metrics inside `SFTTrainer` ran into the fact that `compute_metrics` receives logits rather than generations, and that `predict_with_generate` did not behave the way they expected in the SFT path. Another issue shows a masking pitfall: when using completion-only collation, labels can still be present in `input_ids`, which matters if you try to do generation-based evaluation through the trainer. So “in-training IFEval” and “external post-training IFEval” may not actually be the same measurement path. (GitHub)\n\nIFEval is also sensitive to generation settings. Hugging Face’s generation docs note that `max_new_tokens`, EOS handling, beam settings, and sampling settings directly affect the generated output. Since IFEval strict is sensitive to output format and truncation, a small change in generation config can move the score. (Hugging Face)\n\nThere is one more practical source of drift: LoRA loading. A PEFT issue documents cases where merged and unmerged LoRA inference produced different evaluation outputs even though users expected them to be identical. So if your post-training evaluation loads adapters differently from the in-training path, that can also contribute to ranking reversals. (GitHub)\n\n## My read of your two observations\n\n### Observation 1: lower token accuracy, higher IFEval\n\nMy interpretation is:\n\n  * token accuracy is telling you how well the model imitates the training target continuation under teacher forcing,\n  * IFEval is telling you how well the model obeys free-generation output constraints,\n  * your replay data is especially good at teaching those constraints.\n\n\n\nSo the anti-correlation is plausible and, in your setup, expected. The replay is likely improving **constraint compliance** , not necessarily **reference imitation**. (Hugging Face)\n\n### Observation 2: in-training and post-training IFEval rank runs differently\n\nMy interpretation is:\n\n  * some of the difference is probably real, because replay changes the adaptation-vs-retention trajectory,\n  * some of the difference may be an artifact, because trainer-internal generation-style evaluation is not equivalent to external benchmark evaluation.\n\n\n\nSo I would not trust “in-training IFEval” as a model-selection metric unless it is produced by the exact same external evaluator, with the exact same generation config, on the exact same saved checkpoints. (GitHub)\n\n## What the final numbers suggest\n\nOn the full 541-prompt IFEval benchmark, the gap between `0.675` and `0.660` is only about **8 prompts** , while the gap between `0.675` and `0.610` is about **35 prompts**. That means your conclusion that “no replay is clearly worse” looks much firmer than your conclusion about the exact winner among 5%, 18%, and 50%. The 50% vs 5% gap may still be real, but it is small enough that pipeline details and random variance matter more. (arXiv)\n\n## What I think is most likely true in your case\n\nThe most likely story is this:\n\n  * **0% replay** fits the new SFT data most aggressively and loses more instruction-following behavior, so it ends up worst on final IFEval. (Hugging Face)\n  * **5% replay** gives a small retention anchor, so it can look unusually strong during training while still not being the best final checkpoint. (Hugging Face)\n  * **50% replay** preserves the instruction-following behaviors that IFEval rewards more strongly, so it can finish best even if it is not the best under teacher-forced metrics. (Hugging Face)\n\n\n\nThat is why your results can look “reversed” without anything being broken.\n\n## What to run next\n\nThese are the checks most likely to resolve the puzzle cleanly.\n\n### 1. Re-score every saved checkpoint with one external evaluator\n\nFor all replay ratios, evaluate the same saved checkpoints using one single external script with fixed:\n\n  * tokenizer\n  * chat template\n  * decoding mode\n  * `max_new_tokens`\n  * EOS/stop handling\n  * IFEval implementation\n\n\n\nThis removes the biggest source of pipeline mismatch. (GitHub)\n\n### 2. Report strict and loose together\n\nBecause the IFEval paper explicitly created a loose criterion to handle harmless formatting artifacts, reporting only strict can hide what kind of difference you are seeing. If a replay ratio improves strict and loose together, that is much stronger evidence of real behavior change than strict alone. (arXiv)\n\n### 3. Use teacher-forced loss, not token accuracy, as the SFT-side diagnostic\n\nIf token accuracy is saturating or close to saturating, it stops being useful. Teacher-forced loss usually still contains signal after token accuracy flattens out. TRL exposes both because they are not interchangeable. (Hugging Face)\n\n### 4. Save prompt-level pass/fail vectors\n\nFor each run, store which prompts passed and failed. If replay is mainly helping on things like bullet counts, P.S. endings, exact format, or length limits, that will show up immediately. That kind of prompt-level analysis is far more informative than comparing one average number. The IFEval benchmark is built precisely around these verifiable instruction types. (arXiv)\n\n### 5. Verify your load path\n\nIf your post-training evaluation merges LoRA weights or reloads them in a different way than your training loop uses them, test the same prompt set with:\n\n  * the live training object,\n  * the reloaded adapter,\n  * the merged model.\n\n\n\nIf those outputs differ, some of your reversal is likely load-path drift rather than true model ranking. (GitHub)\n\n## Bottom line\n\nThe clean conclusion is:\n\n**There is no contradiction between your results.**\nThey are consistent with a model that gets better at **imitating the labeled completion** under teacher forcing when replay is low, but gets better at **following explicit output constraints** when replay is higher. Since IFEval scores the second behavior, not the first, replay can improve IFEval while hurting or barely helping token-level metrics. And since trainer-internal and external generation evaluation are not guaranteed to be equivalent, the ranking can differ between in-training and post-training evaluation. (Hugging Face)\n\nThe most defensible operational rule for your project is:\n\n**Use teacher-forced loss as a training diagnostic, but use one fixed external IFEval pipeline for model selection.** (Hugging Face)",
  "title": "Inverse correlation during in-training evaluation: low token accuracy and high IFEval accuracy with reversed results in post-training evaluation"
}