AI & ML Efficiency Breakthrough

Uses a lightweight GRPO-trained policy to select optimal video frames, reducing processing time by 93% while actually improving Video QA accuracy.

March 20, 2026

Original Paper

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

arXiv · 2603.18850

The Takeaway

Instead of uniform sampling, HORNet learns which frames are task-relevant, reducing input data by up to 99%. This enables the processing of long-form video content at a fraction of the usual compute cost without sacrificing (and often improving) reasoning quality.

From the abstract

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces