Reduces the token count of Stable Diffusion 3.5 by 4x for high-resolution generation with minimal fine-tuning.
March 24, 2026
Original Paper
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
arXiv · 2603.22125
The Takeaway
It allows 1024x1024 generation using only 32x32 latent tokens and enables 2048x2048 generation on standard hardware. This is a plug-and-play architectural upgrade for existing latent diffusion models to drastically reduce inference costs.
From the abstract
Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffu