Introduces the first discrete generation model capable of handling high-dimensional (768-1024 dims) representation tokens.
arXiv · March 20, 2026 · 2603.19232
The Takeaway
Current discrete models are limited to low-dimensional latents (8-32 dims), losing semantic richness. By enabling generation on high-dimensional tokens, this allows models to directly predict and generate using the same rich features used for understanding (e.g., CLIP/LLM embeddings) without a lossy bottleneck.
From the abstract
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fund