Demonstrates that gated predictive autoencoders can match or outperform JEPA-style architectures by learning to select predictable components.
arXiv · March 17, 2026 · 2603.14131
The Takeaway
Joint-Embedding Predictive Architecture (JEPA) is currently favored because it avoids pixel reconstruction, but this paper shows that reconstruction loss isn't the 'culprit' if the model is allowed to selectively gate features. This challenges the emerging consensus that generative reconstruction is fundamentally inferior to embedding-based prediction for world models.
From the abstract
We evaluate JEPA-style predictive representation learning versus reconstruction-based autoencoders on a controlled "TV-series" linear dynamical system with known latent state and a single noise parameter. While an initial comparison suggests JEPA is markedly more robust to noise, further diagnostics show that autoencoder failures are strongly influenced by asymmetries in objectives and by bottleneck/component-selection effects (confirmed by PCA baselines). Motivated by these findings, we introdu