Distills high-fidelity joint audio-visual generation into a real-time streaming model capable of 25 FPS on a single GPU.
arXiv · March 13, 2026 · 2603.11647
Why it matters
Most multimodal generators suffer from high latency due to bidirectional attention and modality asymmetry. This framework enables real-time, synchronized audio-video generation, which is a prerequisite for truly responsive interactive AI avatars and live digital content creation.
From the abstract
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme te