First foundation model to unify text, image, audio, and video using native masked diffusion instead of autoregressive serialization.
April 2, 2026
Original Paper
Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
arXiv · 2604.00007
The Takeaway
Moving away from the standard autoregressive 'next-token' paradigm for omnimodal models allows for bidirectional context and iterative refinement. This simplifies the architecture while matching or beating specialized expert systems across all modalities.
From the abstract
We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space,