Modulates LLM hidden states with eye-gaze data to outperform GPT-4o by 10.5 points on streaming video understanding.
March 30, 2026
Original Paper
GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding
arXiv · 2603.25841
The Takeaway
Instead of treating gaze as a visual prompt, this 'GazeQwen' approach injects it into the LLM layers. It proves that small, targeted architectural changes (~5M parameters) can outperform massive closed models on human-centric video tasks.
From the abstract
Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and