AI & ML Scaling Insight

Synthetic data scaling reaches a new level by moving from simple rephrasing to creating 'megadocs' through rationale insertion and stitching.

March 20, 2026

Original Paper

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

arXiv · 2603.18534

The Takeaway

It demonstrates that synthetic data efficiency can be boosted from 1.48x to 1.80x by restructuring it into long-context documents. This provides a blueprint for overcoming the looming 'data wall' in LLM pre-training.

From the abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different dist