{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigorkebvzoeorgdbpcz33exf23oziygvygqhmfv3xpfgmxzohawta",
    "uri": "at://did:plc:4rgrdigiftglskeax4wvmsev/app.bsky.feed.post/3mlnlyaq7x4s2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiflo6xt7is6b2iafwghkjahlgggocme5jwjsbeuqqwcywuvjhmszm"
    },
    "mimeType": "image/png",
    "size": 24783
  },
  "path": "/abs/2605.10019v1",
  "publishedAt": "2026-05-12T00:00:00.000Z",
  "site": "https://arxiv.org",
  "tags": [
    "Binxu Wang",
    "Emma Lucia Byrnes Finn",
    "Bingbin Liu"
  ],
  "textContent": "**Authors:** Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu\n\nGenerative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\\mathrm{rule}}$ and $τ_{\\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \\emph{innovation window} as the interval $[τ_{\\mathrm{rule}}, τ_{\\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\\mathrm{rule}} \\geq τ_{\\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.",
  "title": "The two clocks and the innovation window: When and how generative models learn rules"
}