Proposes a world model that jointly generates appearance and binocular geometry using an epipolar-aware attention mechanism.
arXiv · March 19, 2026 · 2603.17375
The Takeaway
Enables end-to-end stereo video generation (for VR and robotics) 3x faster than current monocular-to-depth pipelines. It grounds metric scale directly in the RGB modality, simplifying the stack for embodied AI.
From the abstract
We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo videothis http URLmonocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positi