AI & ML New Capability

Introduces HopChain, a framework for synthesizing multi-hop vision-language reasoning data that yields generalizable gains across 20+ diverse benchmarks.

arXiv · March 19, 2026 · 2603.17024

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin

The Takeaway

It solves the scarcity of complex, grounded reasoning data for Vision-Language Models (VLMs) by programmatically generating logically dependent chains of visual questions. This enables VLMs to improve significantly on STEM, document understanding, and video tasks without task-specific training.

From the abstract

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopC

Read the original paper →

← Back to today's papers