Provides a learning-theoretic characterization of model collapse, proving exactly when replaying past outputs destroys model diversity.
arXiv · March 13, 2026 · 2603.11784
Why it matters
It offers a formal mathematical foundation for the model collapse phenomenon, identifying that while uniform generation is safe, non-uniform generation is fundamentally limited by replay. This allows practitioners to build training pipelines with theoretical guarantees on data health as the web becomes saturated with AI content.
From the abstract
As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model