AI & ML New Capability

Introduces Action Applicability Policy Optimization to train MLLMs to strategically construct and update visual aids to solve geometry problems.

arXiv · March 20, 2026 · 2603.18662

Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao

The Takeaway

Instead of passive inference on static images, models now learn when and how to 'draw' auxiliary lines or visual updates to reduce reasoning entropy. This moves multimodal agents toward active problem-solving that mimics human geometric 'scratchpad' techniques.

From the abstract

Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, t