AI & ML New Capability

Modulates LLM hidden states with eye-gaze data to outperform GPT-4o by 10.5 points on streaming video understanding.

March 30, 2026

Original Paper

GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

Trong Thang Pham, Hien Nguyen, Ngan Le

arXiv · 2603.25841

The Takeaway

Instead of treating gaze as a visual prompt, this 'GazeQwen' approach injects it into the LLM layers. It proves that small, targeted architectural changes (~5M parameters) can outperform massive closed models on human-centric video tasks.

From the abstract

Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and

Read the original paper →

← Back to today's papers