AI & ML Paradigm Shift

Proposes a unified image tokenizer that reconciles the conflicting requirements of visual understanding and generation using a residual evolution process.

arXiv · March 13, 2026 · 2603.12108

Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang

Why it matters

Instead of separate or decoupled latent spaces, it uses a cascaded residual sequence to transition from low-level pixels to high-level semantics, enabling a single model to excel at both reconstruction and comprehension.

From the abstract

The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectivel

Read the original paper →

← Back to today's papers