Achieves 4x visual token compression and 80% lower training cost while unifying multimodal comprehension and generation.
arXiv · March 16, 2026 · 2603.12793
Why it matters
It introduces a 'gated detail residual' mechanism that decouples semantic tokens from high-frequency patch details. This allows unified models to maintain generation fidelity without bloating the token sequence, significantly lowering the computational barrier for high-resolution multimodal LLMs.
From the abstract
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image g