AI & ML Efficiency Breakthrough

Distills high-fidelity joint audio-visual generation into a real-time streaming model capable of 25 FPS on a single GPU.

arXiv · March 13, 2026 · 2603.11647

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan

Why it matters

Most multimodal generators suffer from high latency due to bidirectional attention and modality asymmetry. This framework enables real-time, synchronized audio-video generation, which is a prerequisite for truly responsive interactive AI avatars and live digital content creation.

From the abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme te