AI & ML New Capability

Shifts multimodal LLMs from static image prefixes to an active, sequential 'Visual Chain-of-Thought' that explores images based on saliency.

March 31, 2026

Original Paper

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun

arXiv · 2603.26737

The Takeaway

Current VLMs process images as fixed grids; SSV-CoT implements a goal-driven curriculum where the model sequentially attends to primary and then secondary cues. This end-to-end method significantly improves complex visual reasoning without requiring expensive region-level annotations.

From the abstract

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Secon