A 1D continuous image tokenizer that uses semantic masking to achieve a 64x reduction in token usage without sacrificing generation fidelity.
April 1, 2026
Original Paper
MacTok: Robust Continuous Tokenization for Image Generation
arXiv · 2603.29634
The Takeaway
By preventing posterior collapse through DINO-guided masking, it allows high-quality 512x512 image generation using as few as 64 tokens. This drastically lowers the compute floor for high-fidelity generative vision models.
From the abstract
Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking an