A unified streaming visual backbone that performs perception, 3D reconstruction, and robotic action simultaneously from a continuous video stream.
arXiv · March 13, 2026 · 2603.12265
Why it matters
Most vision models are specialized for either semantics, geometry, or temporal modeling; OmniStream unifies these using causal spatiotemporal attention and 3D-RoPE. It allows a single frozen backbone to handle diverse tasks in real-time via a persistent KV-cache, making it highly suitable for interactive agents.
From the abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attent