AI & ML Efficiency Breakthrough

Matches the performance of the complex SFT+GRPO reasoning pipeline for Vision-Language Models in 1/7th of the training time.

March 20, 2026

Original Paper

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

arXiv · 2603.18656

The Takeaway

It introduces a scheduled curriculum adaptive loss (SCALe) that handles the token imbalance in long reasoning traces. This allows practitioners to train reasoning-capable VLMs with significantly less compute while avoiding the stability issues of reinforcement learning.

From the abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and