Reveals that 'erasing' concepts from video diffusion models only suppresses output rather than removing the underlying representations.
March 24, 2026
Original Paper
PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models
arXiv · 2603.21547
The Takeaway
The PROBE protocol proves that sensitive concepts (nudity, etc.) can be reactivated in 'safe' models via simple latent optimization. This highlights a fundamental flaw in current safety auditing and concept erasure techniques for T2V models.
From the abstract
Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo