AI & ML Breaks Assumption

Discovers that skipping learning rate decay during pre-training, while appearing worse for pre-train loss, significantly improves the model's adaptability during supervised fine-tuning (SFT).

arXiv · March 18, 2026 · 2603.16127

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki

The Takeaway

Challenges the industry-standard 'cosine decay' recipe; the findings suggest models trained without decay settle in flatter minima, making them much more robust and performant for downstream instruction tuning.

From the abstract

We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup