Introduces a verifier that operates directly on the latent hidden states of Diffusion Transformers, avoiding the need for costly pixel-space decoding during inference-time scaling.
March 25, 2026
Original Paper
Tiny Inference-Time Scaling with Latent Verifiers
arXiv · 2603.22492
The Takeaway
This reduces the compute FLOPs of verification-based generation by 51%. For practitioners using 'best-of-N' sampling or verifier-guided generation, this enables scaling test-time compute with significantly less VRAM and time overhead.
From the abstract
Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into