AI & ML Paradigm Shift

Proposes modeling the world in the feature space of frozen geometry foundation models instead of pixels, achieving 5x faster depth forecasting.

arXiv · March 16, 2026 · 2603.12655

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo

Why it matters

By side-stepping video generation, this world model avoids common geometric hallucinations and photometric noise, establishing that predictive states are better handled at the feature level for 3D world modeling.

From the abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight tempor

Read the original paper →

← Back to today's papers