{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigorkebvzoeorgdbpcz33exf23oziygvygqhmfv3xpfgmxzohawta",
"uri": "at://did:plc:4rgrdigiftglskeax4wvmsev/app.bsky.feed.post/3mlnlyaq7x4s2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiflo6xt7is6b2iafwghkjahlgggocme5jwjsbeuqqwcywuvjhmszm"
},
"mimeType": "image/png",
"size": 24783
},
"path": "/abs/2605.10019v1",
"publishedAt": "2026-05-12T00:00:00.000Z",
"site": "https://arxiv.org",
"tags": [
"Binxu Wang",
"Emma Lucia Byrnes Finn",
"Bingbin Liu"
],
"textContent": "**Authors:** Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu\n\nGenerative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\\mathrm{rule}}$ and $τ_{\\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \\emph{innovation window} as the interval $[τ_{\\mathrm{rule}}, τ_{\\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\\mathrm{rule}} \\geq τ_{\\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.",
"title": "The two clocks and the innovation window: When and how generative models learn rules"
}