AI & ML Efficiency Breakthrough

Decouples data mixture ratio selection from continual pre-training by optimizing distribution vectors post-hoc with 15-35x lower compute cost.

April 1, 2026

Original Paper

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Haiyue Song, Masao Utiyama

arXiv · 2603.28858

The Takeaway

Traditionally, finding the right data mix for domain adaptation requires expensive trial-and-error training runs. OptiMer allows practitioners to train on datasets individually and then 'search' for the optimal blend without re-training, significantly accelerating LLM adaptation.

From the abstract

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset