{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid2tdcq54xie5i2od7owicfffnyxcrjc24gayjet4ggilxbjpetny",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnhxqv52wfr2"
  },
  "path": "/t/why-does-naive-replay-still-beat-most-sophisticated-continual-learning-methods-in-practice/176513#post_2",
  "publishedAt": "2026-06-04T13:50:58.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Experience Replay for Continual Learning",
    "Dark Experience Replay",
    "The Effectiveness of Memory Replay in Large Scale Continual Learning",
    "A Comprehensive Empirical Evaluation on Online Continual Learning",
    "Spurious Forgetting in Continual Learning of Language Models",
    "How to Alleviate Catastrophic Forgetting in LLMs Fine-tuning?",
    "O-LoRA",
    "LoRA Learns Less and Forgets Less",
    "InsCL",
    "Self-Synthesized Rehearsal",
    "SuRe",
    "On-Policy Replay for Continual Supervised Fine-Tuning"
  ],
  "textContent": "Hmm… Difficult… I had an LLM reason through this fairly freely, using existing papers and related discussions as grounding:\n\n* * *\n\nI think your update-constraint intuition is reasonable. I would not read the replay results as “the clever methods are useless,” or as “naive random replay is the final answer.” My more modest read is:\n\n> Replay is hard to beat because it is a very direct behavioral anchor.\n>  EWC/Fisher-style methods protect an estimated parameter-importance proxy.\n>  O-LoRA-style methods constrain update geometry.\n>  Replay simply re-exposes the model to examples of the old/general behavior we care about preserving.\n\nSo I would frame this less as:\n\n> replay vs. EWC vs. O-LoRA\n\nand more as:\n\n> what behavioral anchor do we need, and how much destructive plasticity should we allow?\n\nUnder that framing, replay and update constraints are not really competing in the strongest sense. They address different failure modes.\n\n## TL;DR\n\nMy current read is:\n\n  * **Replay remains a strong baseline** because it directly anchors the evaluated behavior.\n  * **EWC/Fisher-style regularization can be fragile** because it protects an indirect parameter-importance estimate, not behavior itself.\n  * **O-LoRA / orthogonal subspace / freezing / LoRA rank control are still useful** , but I would view them as ways to reduce destructive plasticity, not as complete replacements for behavioral anchors.\n  * **LoRA-style methods may forget less partly because they update less aggressively** , which is useful, but it can also mean they learn less from the new domain.\n  * **For 10–20 domains, naive random replay probably becomes too clean a story.** The question becomes what to replay, when to replay, and how to consolidate.\n  * **In production, I would probably start with constrained PEFT + small curated/general/synthetic replay + regression evals** , not pure replay and not pure parameter regularization.\n  * For volatile factual updates, I would often avoid weight updates entirely and use RAG/tools/DB instead.\n\n\n\nNone of this is meant as a strong claim that replay is “the correct method.” I would phrase it more cautiously: replay is a surprisingly hard baseline to replace because it is close to the behavior we evaluate.\n\n* * *\n\n## 1. Why replay may be hard to beat\n\nThe simplest explanation is also the most important one:\n\n> Replay puts old/general behavior back into the loss.\n\nThat sounds almost too simple, but it matters. Many other continual-learning methods protect something indirect.\n\nMethod family | What it directly constrains | What it only indirectly protects\n---|---|---\nEWC / Fisher regularization | movement of parameters estimated to be important | old-task behavior\nO-LoRA / orthogonal subspace methods | interference between update subspaces | specific outputs, formats, capabilities\nGradient projection methods | harmful gradient directions under some criterion | full behavior under old/general prompts\nReplay | behavior on replayed examples | behavior outside replay coverage\nReplay + constrained PEFT | behavior anchor + limited plasticity | still depends on buffer/eval design\n\nThis is why I would be cautious about saying that replay is “dumb.” It is not sophisticated, but it is close to the evaluation surface.\n\nIf the benchmark asks:\n\n> Can the model still do old task X after learning new task Y?\n\nthen replay directly says:\n\n> Keep training on examples of old task X.\n\nEWC says something more indirect:\n\n> Do not move parameters that appear important for old task X.\n\nO-LoRA says:\n\n> Put new task updates in a subspace that interferes less with previous task updates.\n\nBoth are sensible ideas. But they are further away from the measured behavior.\n\nThis is also not only an LLM-specific oddity. In older continual-learning settings, replay has repeatedly survived as a strong baseline. For example:\n\n  * Experience Replay for Continual Learning\n  * Dark Experience Replay\n  * The Effectiveness of Memory Replay in Large Scale Continual Learning\n  * A Comprehensive Empirical Evaluation on Online Continual Learning\n\n\n\nThe LLM setting adds complications: instruction-following, output format, safety behavior, style, calibration, and task alignment. But the basic point remains similar: replay directly reintroduces examples of the behavior we want to preserve.\n\n* * *\n\n## 2. Replay may preserve more than “old knowledge”\n\nOne subtle point: in LLMs, some measured forgetting may not be literal erasure of knowledge.\n\nIt may also be:\n\n  * loss of task alignment,\n  * loss of output convention,\n  * shift in answer style,\n  * shift in refusal/safety behavior,\n  * shift in response length,\n  * shift in formatting,\n  * shift in which latent “mode” the model enters for a given prompt.\n\n\n\nThis does not make the degradation less real. From the user or benchmark perspective, the model still got worse. But it changes which mitigation methods we should expect to help.\n\nThe paper Spurious Forgetting in Continual Learning of Language Models makes a related point: performance drops in LLM continual learning can sometimes reflect declining task alignment rather than true loss of underlying knowledge.\n\nThat makes replay especially plausible as a mitigation. Replay does not only remind the model of old facts. It also re-exposes the model to old prompt-response interfaces.\n\nFor instruction-tuned LLMs, that can matter a lot.\n\nA small example:\n\nWhat may drift after sequential fine-tuning | Why replay can help\n---|---\nThe model starts answering everything in the new domain’s style | Replay reintroduces general-domain answer styles\nJSON/tool-call formatting degrades | Replay reintroduces format-contract examples\nSafety/refusal behavior shifts | Replay reintroduces safety boundary examples\nOld tasks are still “known” but not triggered correctly | Replay reintroduces old prompt-response patterns\nGeneral reasoning/coding/math degrades | General/capability replay reanchors those behaviors\n\nSo I would not interpret replay’s strength as only “memory storage.” It may also be task-interface and behavior-distribution anchoring.\n\n* * *\n\n## 3. Why EWC/Fisher-style methods can underperform\n\nI would be cautious about expecting EWC/Fisher-style regularization to carry the whole burden in LLM continual fine-tuning.\n\nThat is not because the idea is unreasonable. It is reasonable: identify parameters important for old tasks and penalize moving them too much.\n\nBut there are several practical issues.\n\nIssue | Why it matters\n---|---\nCost | Importance estimation can be expensive at LLM scale.\nNoise | Fisher/importance estimates depend on the data used to estimate them.\nTuning sensitivity | Too weak: forgetting remains. Too strong: the model cannot learn the new task.\nProxy gap | Parameter importance is not the same thing as behavior preservation.\n\nA useful way to phrase it:\n\n> EWC protects an estimated parameter-space proxy. Replay protects behavior on selected examples.\n\nThat does not mean EWC-style methods are useless. I would just be cautious about asking them to replace behavioral anchors entirely.\n\nFor example, How to Alleviate Catastrophic Forgetting in LLMs Fine-tuning? discusses Fisher/EWCLoRA-style costs and reports substantial overhead for Fisher estimation in a GPT-J-6B setting. I would treat that kind of result less as “EWC is bad” and more as “this may be operationally awkward and proxy-heavy at LLM scale.”\n\n* * *\n\n## 4. Why I still like update constraints\n\nI actually agree with the update-constraint intuition.\n\nO-LoRA is a good example of a sensible direction. O-LoRA learns tasks in different low-rank subspaces and keeps those subspaces orthogonal to reduce interference. That is attractive for several reasons:\n\n  * it is parameter-efficient,\n  * it reduces direct interference,\n  * it does not require storing old user data for replay,\n  * it fits naturally with PEFT-style workflows.\n\n\n\nSo I would not say:\n\n> replay good, O-LoRA bad\n\nI would say:\n\n> replay and O-LoRA-like constraints solve different parts of the problem.\n\nComponent | What it gives you\n---|---\nReplay / rehearsal / regression examples | A behavioral specification of what should remain stable\nO-LoRA / freezing / low-rank constraints | A way to reduce destructive plasticity\nEvaluation suites | A way to detect regressions that replay did not cover\nRAG/tools/DB | A way to avoid putting volatile facts into weights\nRetraining from a stable checkpoint | A reset path when incremental updates become too messy\n\nI would see update constraints as complementary to replay/eval anchors, not as their direct replacement.\n\nA concise version:\n\n> O-LoRA tells the model where it is allowed to move.\n>  Replay/evals tell us what behavior must not change.\n\nBoth are useful.\n\n* * *\n\n## 5. LoRA is also a stability-plasticity knob\n\nOne paper I would keep in mind here is LoRA Learns Less and Forgets Less.\n\nThe high-level takeaway is not simply “LoRA is better” or “LoRA is worse.” It is more like:\n\n> Low-rank adaptation can preserve the base distribution better partly because it updates the model less aggressively.\n\nThat is useful for continual learning, but it also means LoRA can under-adapt on sufficiently large or difficult domain shifts.\n\nSo I would treat these as knobs, not universal answers:\n\n  * LoRA rank,\n  * target modules,\n  * learning rate,\n  * amount of freezing,\n  * replay ratio,\n  * replay selection,\n  * whether to merge adapters,\n  * whether to restart from a stable checkpoint.\n\n\n\nThis matters because “forgetting less” and “learning the new domain less” can sometimes look similar if we only look at old-task retention.\n\nA practical framing:\n\nIf you want more stability | If you want more plasticity\n---|---\nlower rank LoRA | higher rank LoRA / rank-stabilized LoRA\nmore frozen layers | more trainable modules\nmore replay/general anchors | less replay pressure\nsmaller learning rate | larger learning rate\nstricter eval gate | faster adaptation\nadapter isolation | more sharing / merging / full fine-tuning\n\nI would not pick one side abstractly. It depends on whether the update is a small style/domain adaptation or a large capability shift.\n\n* * *\n\n## 6. What happens with 10–20 domains?\n\nThis is where I think the simple “naive replay wins” story becomes less clean.\n\nFor short sequences, random replay can look very strong because the buffer still covers the old tasks reasonably well.\n\nFor longer sequences, I would expect the bottleneck to move to:\n\n  * replay selection,\n  * replay scheduling,\n  * replay staleness,\n  * privacy/licensing constraints,\n  * consolidation,\n  * general capability retention,\n  * task interference,\n  * evaluation coverage.\n\n\n\nSo I would not say:\n\n> naive random replay scales indefinitely\n\nI would say:\n\n> the replay principle remains strong, but naive random replay becomes a design problem.\n\nRecent work seems to move in that direction.\n\nExamples:\n\n  * InsCL uses instruction information to guide replay selection.\n  * Self-Synthesized Rehearsal uses synthetic rehearsal when old real data is not available.\n  * SuRe frames replay failures around selection and integration, using surprise-prioritized replay and slow-weight consolidation.\n  * On-Policy Replay for Continual Supervised Fine-Tuning is a recent direction suggesting that replay may be useful not only because of fixed old labels, but because it anchors old prompt distributions under the current model policy.\n\n\n\nI would treat the newer papers as suggestive rather than settled. But they point in a similar direction:\n\n> The practical frontier is not replay vs. no replay.\n>  It is what replay signal, what replay source, what replay schedule, and what update constraint.\n\n* * *\n\n## 7. Different replay variants are not equivalent\n\nIt may help to separate “replay” into several different things.\n\nReplay variant | What it anchors | When it is useful\n---|---|---\nRaw old-task replay | old examples and labels | when old data is legal, available, and still valid\nCurated regression replay | specific behaviors/capabilities | production regression protection\nGeneral/capability replay | base behavior, reasoning, instruction-following | preventing global model drift\nSynthetic replay | generated old-task-like behavior | when old data cannot be stored\nOn-policy replay | old prompts with current-policy responses | reducing drift while avoiding stale off-policy targets\nApproximate regularized replay | initial-model behavior / pretraining-like distribution | lower-overhead anchoring without raw old data\nEval-only canaries | detection rather than training | safety/tool/schema regression monitoring\n\nThis is why I would avoid saying “replay” as if it were one fixed method.\n\nThe production question is usually more specific:\n\n> What can I legally keep?\n>  What behavior must not regress?\n>  How stale are the old examples?\n>  How much new-domain plasticity do I need?\n>  How expensive is full retraining?\n>  What failures are covered by evals but not replay?\n\n* * *\n\n## 8. A practical recipe I would try\n\nAs a default engineering recipe, I would probably start with something like this.\n\nSituation | I would consider\n---|---\nVolatile facts, documents, prices, APIs, policies | RAG/tools/DB rather than weight updates\nSmall domain/style adaptation | LoRA/QLoRA + small curated replay/eval anchor\nNeed to preserve general instruction-following | general/capability replay\nOld user data cannot be retained | synthetic replay or eval-only canaries\n10–20 sequential domains | selected replay + scheduling + consolidation\nSafety / schema / tool-call behavior | replay + evals + validators/constrained decoding\nMajor policy or general behavior update | restart from a stable checkpoint or do a more cumulative re-finetune\nUnsure what is regressing | run old/new/general evals before choosing a CL method\n\nIn other words, I would not start from:\n\n> Which continual-learning algorithm is best?\n\nI would start from:\n\n> What kind of update is this?\n\nThen choose the least dangerous mechanism.\n\n### If it is factual and volatile\n\nI would avoid fine-tuning if possible.\n\nUse:\n\n  * retrieval,\n  * tools,\n  * database,\n  * document index,\n  * model editing only if truly appropriate.\n\n\n\n### If it is a stable skill/style/format update\n\nI would use PEFT, but keep behavioral anchors.\n\nUse:\n\n  * LoRA/QLoRA,\n  * small curated replay,\n  * general replay,\n  * regression evals,\n  * safety/schema/tool-call canaries.\n\n\n\n### If it is a large behavior/policy update\n\nI would be more cautious with endless sequential adapters.\n\nUse:\n\n  * stable checkpoint restart,\n  * broader re-finetuning,\n  * cumulative or selectively replayed training mix,\n  * full regression suite,\n  * rollback plan.\n\n\n\n* * *\n\n## 9. My answer to the original questions\n\n### Q1. Is replay only winning because benchmarks are short?\n\nPartly possible. I would be cautious about over-interpreting short task sequences.\n\nBut I do not think the right conclusion is:\n\n> replay is a benchmark trick\n\nI would say:\n\n> replay is a strong principle, while naive random replay becomes less satisfying as the stream gets longer.\n\nFor 10–20 domains, I would expect selection, scheduling, staleness, and consolidation to matter much more.\n\n### Q2. Why do EWC/Fisher-style methods underperform?\n\nMy guess is: several reasons at once.\n\n  * Fisher/importance estimation is expensive.\n  * The estimate is data-dependent and noisy.\n  * The penalty is hard to tune.\n  * Most importantly, parameter importance is an indirect proxy for behavior preservation.\n\n\n\nThat last point is the one I would emphasize most cautiously.\n\n### Q3. Should we prefer update constraints?\n\nI think the intuition is good, but I would not make it exclusive.\n\nUpdate constraints reduce destructive plasticity. Replay/evals define what should remain stable.\n\nSo my practical answer would be:\n\n> use update constraints to make replay/eval easier, not necessarily to eliminate them.\n\n### Q4. What would I do in production?\n\nMy default would be:\n\n> constrained PEFT + small curated/general/synthetic replay + regression evals\n\nwith RAG/tools/DB for volatile factual updates.\n\nI would tune the replay ratio, LoRA rank, learning rate, and amount of freezing as stability-plasticity knobs rather than assuming a universal setting.\n\n* * *\n\n## 10. Bottom line\n\nI would not read the current pattern as:\n\n> sophisticated CL methods are useless\n\nor:\n\n> replay is the final answer\n\nor:\n\n> O-LoRA/update constraints are the wrong direction\n\nI would read it more modestly:\n\n> Behavioral anchoring is very hard to replace with purely parameter-space or update-space constraints.\n\nReplay is one simple way to provide that anchor. O-LoRA, freezing, LoRA rank control, and related methods are useful ways to reduce destructive plasticity. In practice, I would expect the robust recipe to combine both.\n\nSo my current framing would be:\n\n> Not replay vs. update constraints.\n>  Behavioral anchor + constrained update + regression evals.\n\nThat seems like the most practical way to interpret why naive replay remains such a hard baseline to beat.",
  "title": "Why does naive replay still beat most \"sophisticated\" continual-learning methods in practice?"
}