AI & ML New Capability

Introduces event-gated sampling to eliminate interaction hallucinations in video generation, such as objects drifting after placement.

arXiv · March 17, 2026 · 2603.13402

Chika Maduabuchi

The Takeaway

Addresses a fundamental failure in current DiT-based video models where motion occurs without contact. By explicitly grounding sampling in a lightweight event head, it enables state persistence and spatial accuracy previously missing in SOTA models.

From the abstract

State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling e

Read the original paper →

← Back to today's papers