Proposes a unified image tokenizer that reconciles the conflicting requirements of visual understanding and generation using a residual evolution process.
arXiv · March 13, 2026 · 2603.12108
Why it matters
Instead of separate or decoupled latent spaces, it uses a cascaded residual sequence to transition from low-level pixels to high-level semantics, enabling a single model to excel at both reconstruction and comprehension.
From the abstract
The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectivel