Synthetic Mixed Training allows an 8B model to finally outperform RAG on long-document comprehension by combining synthetic QAs with rewritten documents.
March 26, 2026
Original Paper
Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
arXiv · 2603.23562
The Takeaway
It provides a blueprint for breaking the 'RAG ceiling' where parametric knowledge usually trails retrieval. By demonstrating log-linear scaling with synthetic data volume and generator strength, it shows how models can internalize massive datasets more effectively than fetching them.
From the abstract
Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synth