AI & ML Paradigm Shift

First foundation model to unify text, image, audio, and video using native masked diffusion instead of autoregressive serialization.

April 2, 2026

Original Paper

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon, Mintaek Lim, Yunseok Han, Dogeun Kim, Hoeun Lee, Hyunggeun Kim, Jaeyoung Do

arXiv · 2604.00007

The Takeaway

Moving away from the standard autoregressive 'next-token' paradigm for omnimodal models allows for bidirectional context and iterative refinement. This simplifies the architecture while matching or beating specialized expert systems across all modalities.

From the abstract

We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space,