AI & ML New Capability

A unified streaming visual backbone that performs perception, 3D reconstruction, and robotic action simultaneously from a continuous video stream.

arXiv · March 13, 2026 · 2603.12265

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

Why it matters

Most vision models are specialized for either semantics, geometry, or temporal modeling; OmniStream unifies these using causal spatiotemporal attention and 3D-RoPE. It allows a single frozen backbone to handle diverse tasks in real-time via a persistent KV-cache, making it highly suitable for interactive agents.

From the abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attent