AI & ML Breaks Assumption

Challenges the standard 'pretrain-then-finetune' pipeline by showing that repeating domain-specific data during pretraining is significantly more effective.

arXiv · March 18, 2026 · 2603.16177

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini

The Takeaway

The paper demonstrates 'Specialized Pretraining' (SPT) reduces the compute needed for domain adaptation by up to 1.75x compared to standard finetuning. It provides empirical scaling laws that help practitioners determine exactly how many times to repeat small, high-quality domain datasets during the pretraining phase to maximize performance.

From the abstract

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPi