Establishes scaling laws to determine the optimal compute split between general pretraining and domain-specific specialization.
March 20, 2026
Original Paper
Optimal Splitting of Language Models from Mixtures to Specialized Domains
arXiv · 2603.19149
The Takeaway
Provides a predictable formula for allocating tokens when specializing models, moving beyond the heuristic 'two-stage' recipe. For practitioners building verticalized LLMs, this method accurately extrapolates performance to larger models and budgets, ensuring compute is not wasted.
From the abstract
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model t