Establishes a three-dimensional scaling law for RAG-pretraining, modeling the optimal data budget allocation between model parameters, tokens, and retrieval store size.
April 2, 2026
Original Paper
To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
arXiv · 2604.00715
The Takeaway
It provides the first quantitative framework for deciding when to stop pretraining and start investing in a larger retrieval corpus. This is critical for practitioners building knowledge-intensive systems who need to balance compute costs with inference-time accuracy.
From the abstract
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and da