AI & ML Paradigm Shift

Pixelis shifts VLM reasoning from static description to a 'reasoning in pixels' agentic paradigm that learns via an executable tool grammar.

March 27, 2026

Original Paper

Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

arXiv · 2603.25091

The Takeaway

Instead of predicting text labels, the model learns to manipulate pixel space (zoom, track, segment) and improves through curiosity-driven RL. This bridges the gap between passive observation and physically grounded visual intelligence.

From the abstract

Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences