AI & ML Efficiency Breakthrough

Accelerates Diffusion Transformers (DiTs) by 2x using a training-free framework that selectively reduces computation in non-aesthetic image regions.

arXiv · March 16, 2026 · 2603.12575

Xuanhua Yin, Chuanzhi Xu, Haoxian Zhou, Boyu Wei, Weidong Cai

Why it matters

Identifies that denoising is spatially non-uniform and that aesthetic tokens in prompts drive specific attention regions. By masking low-affinity regions and using step-level prediction caching, it achieves significant speedups and improved visual quality without any fine-tuning.

From the abstract

Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low