New AI models can now tell the difference between a real smile and one that is hiding a secret grudge.
April 29, 2026
Original Paper
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
arXiv · 2604.23198
The Takeaway
Most video AI just labels the objects and actions it sees, like person walking or dog barking. This StoryTR benchmark requires the AI to use Theory of Mind to infer the hidden motivations and feelings of the people on screen. It tests whether the model can understand complex social dynamics, such as a character pretending to be happy to deceive someone else. Moving from surface-level recognition to mental state inference is a major step toward AI that can navigate human social life. This capability is essential for creating AI assistants that actually understand the context of human interactions.
From the abstract
Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k sam