AI & ML New Capability

Proposes a world model that jointly generates appearance and binocular geometry using an epipolar-aware attention mechanism.

arXiv · March 19, 2026 · 2603.17375

Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi

The Takeaway

Enables end-to-end stereo video generation (for VR and robotics) 3x faster than current monocular-to-depth pipelines. It grounds metric scale directly in the RGB modality, simplifying the stack for embodied AI.

From the abstract

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo videothis http URLmonocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positi