UNITE enables single-stage joint training of the tokenizer and the diffusion model from scratch, removing the need for frozen VAEs.
March 24, 2026
Original Paper
End-to-End Training for Unified Tokenization and Latent Denoising
arXiv · 2603.22283
The Takeaway
It simplifies the complex staging of Latent Diffusion Model (LDM) training. By viewing tokenization and generation as the same latent inference problem, it creates a 'common latent language' that reaches SOTA FID scores without using pretrained encoders like DINO.
From the abstract
Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is