Accelerates Diffusion Transformers (DiTs) by 2x using a training-free framework that selectively reduces computation in non-aesthetic image regions.
arXiv · March 16, 2026 · 2603.12575
Why it matters
Identifies that denoising is spatially non-uniform and that aesthetic tokens in prompts drive specific attention regions. By masking low-affinity regions and using step-level prediction caching, it achieves significant speedups and improved visual quality without any fine-tuning.
From the abstract
Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low