V-JEPA 2.1 unlocks dense, spatially structured features in video self-supervised learning, yielding massive gains in robotic manipulation and navigation.
arXiv · March 17, 2026 · 2603.14482
The Takeaway
By combining hierarchical self-supervision with a dense predictive loss, it creates representations that are grounded both spatially and temporally. The 20-point success rate improvement in real-robot grasping demonstrates that these features translate directly to complex physical tasks.
From the abstract
We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchi