Pixelis shifts VLM reasoning from static description to a 'reasoning in pixels' agentic paradigm that learns via an executable tool grammar.
March 27, 2026
Original Paper
Pixelis: Reasoning in Pixels, from Seeing to Acting
arXiv · 2603.25091
The Takeaway
Instead of predicting text labels, the model learns to manipulate pixel space (zoom, track, segment) and improves through curiosity-driven RL. This bridges the gap between passive observation and physically grounded visual intelligence.
From the abstract
Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences