Enables training of monocular novel-view synthesis models using entirely unpaired, in-the-wild internet images.
March 25, 2026
Original Paper
One View Is Enough! Monocular Training for In-the-Wild Novel View Generation
arXiv · 2603.23488
The Takeaway
Traditionally, NVS requires multi-view pairs for supervision, which are hard to collect at scale. OVIE uses depth-guided geometric scaffolds and masked training to learn 3D consistency from 30 million uncurated images, democratizing the data source for 3D vision.
From the abstract
Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training f