AI & ML Scaling Insight

Establishes scaling laws to determine the optimal compute split between general pretraining and domain-specific specialization.

March 20, 2026

Original Paper

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier

arXiv · 2603.19149

The Takeaway

Provides a predictable formula for allocating tokens when specializing models, moving beyond the heuristic 'two-stage' recipe. For practitioners building verticalized LLMs, this method accurately extrapolates performance to larger models and budgets, ensuring compute is not wasted.

From the abstract

Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model t

Read the original paper →

← Back to today's papers