AI & ML Breaks Assumption

Discovers that frozen video diffusion models already encode physical plausibility in their features, allowing for cost-effective inference-time physics filtering.

arXiv · March 17, 2026 · 2603.14294

Chujun Tang, Lei Zhong, Fangqiang Ding

The Takeaway

It demonstrates that frozen Diffusion Transformer (DiT) features contain recoverable cues about physical consistency. By using a lightweight verifier to prune 'physically impossible' denoising trajectories early, the method improves video quality while reducing inference costs compared to standard Best-of-K sampling.

From the abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce prog