Integrates Chain-of-Thought reasoning directly into the Diffusion Transformer denoising process to solve complex spatial and logical tasks.
arXiv · March 13, 2026 · 2603.12252
Why it matters
It brings the reasoning capabilities of models like o1/R1 into the generative vision domain. This allows diffusion models to solve structured problems (like Maze or Sudoku generation) that were previously impossible for single-pass generative models.
From the abstract
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the