Fits promptable visual segmentation (SAM) into a 1.3M parameter model for real-time in-sensor execution.
arXiv · March 13, 2026 · 2603.11917
Why it matters
It brings the capability of the Segment Anything Model (SAM) directly into vision sensors like the Sony IMX500 with sub-12ms latency. This democratizes high-quality segmentation for power-constrained IoT devices and smart glasses.
From the abstract
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2