EntropyCache achieves up to 26x speedup for Diffusion Language Models by using decoded token entropy as a proxy for KV cache staleness.
March 20, 2026
Original Paper
EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models
arXiv · 2603.18489
The Takeaway
Diffusion-based LLMs typically require full forward passes at every denoising step; this method enables training-free, sparse KV caching that costs only 0.5% of inference time. It makes a new class of non-autoregressive models computationally competitive with standard Transformers.
From the abstract
Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for decid