AI & ML Efficiency Breakthrough

Reduces the token count of Stable Diffusion 3.5 by 4x for high-resolution generation with minimal fine-tuning.

March 24, 2026

Original Paper

DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

Xin Cai, Zhiyuan You, Zhoutong Zhang, Tianfan Xue

arXiv · 2603.22125

The Takeaway

It allows 1024x1024 generation using only 32x32 latent tokens and enables 2048x2048 generation on standard hardware. This is a plug-and-play architectural upgrade for existing latent diffusion models to drastically reduce inference costs.

From the abstract

Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffu