Achieves state-of-the-art video understanding without the need for expensive human-annotated Chain-of-Thought (CoT) data.
March 30, 2026
Original Paper
Reinforcing Structured Chain-of-Thought for Video Understanding
arXiv · 2603.25942
The Takeaway
It replaces costly supervised fine-tuning (SFT) with a Summary-Driven Reinforcement Learning framework. By using self-supervised consistency rewards, it forces MLLMs to ground their reasoning in factual visual summaries, effectively eliminating the 'thinking drift' common in long-form video analysis.
From the abstract
Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability