AI & ML Efficiency Breakthrough

Achieves 4x visual token compression and 80% lower training cost while unifying multimodal comprehension and generation.

arXiv · March 16, 2026 · 2603.12793

Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun

Why it matters

It introduces a 'gated detail residual' mechanism that decouples semantic tokens from high-frequency patch details. This allows unified models to maintain generation fidelity without bloating the token sequence, significantly lowering the computational barrier for high-resolution multimodal LLMs.

From the abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image g