A diagnostic revealing that over 50% of video understanding benchmark samples can be solved without any video or temporal context.
April 1, 2026
Original Paper
Video-Oasis: Rethinking Evaluation of Video Understanding
arXiv · 2603.29616
The Takeaway
This is a major critique of the current video LLM landscape, showing that 'SOTA' models are often just exploiting linguistic priors or static image features. It provides a new diagnostic suite (Video-Oasis) that forces researchers to evaluate true spatio-temporal reasoning rather than benchmark shortcuts.
From the abstract
The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sust