AI & ML Efficiency Breakthrough

Distribution-Conditioned Diffusion Decoding enables high-fidelity image generation from pre-trained VLMs without expensive full-model retraining.

March 17, 2026

Original Paper

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, Chong Luo, Chang D. Yoo

arXiv · 2603.13389

The Takeaway

Current VLMs are limited by discrete token artifacts; this method uses a lightweight diffusion decoder on top of existing VLM logits to produce high-quality images. It allows developers to achieve state-of-the-art visual quality using only ImageNet-1K-level training costs.

From the abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we