Motion-MLLM integrates IMU egomotion data into Video-LLMs to solve the fundamental scale and spatial reasoning ambiguities of purely visual models.
arXiv · March 19, 2026 · 2603.17980
The Takeaway
Visual models struggle to resolve absolute physical size and distance in monocular video. By grounding video frames in physical IMU trajectories (available on most mobile and robotic devices), this framework enables precise 3D spatial reasoning with 1.6x better cost-effectiveness than explicit 3D representations.
From the abstract
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a nove