AI & ML New Capability

V-JEPA 2.1 unlocks dense, spatially structured features in video self-supervised learning, yielding massive gains in robotic manipulation and navigation.

arXiv · March 17, 2026 · 2603.14482

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

The Takeaway

By combining hierarchical self-supervision with a dense predictive loss, it creates representations that are grounded both spatially and temporally. The 20-point success rate improvement in real-robot grasping demonstrates that these features translate directly to complex physical tasks.

From the abstract

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchi

Read the original paper →

← Back to today's papers