INSID3 achieves state-of-the-art one-shot image segmentation using only frozen DINOv3 features without any training, fine-tuning, or auxiliary models.
March 31, 2026
Original Paper
INSID3: Training-Free In-Context Segmentation with DINOv3
arXiv · 2603.28480
The Takeaway
It demonstrates that scaled-up self-supervised features contain enough spatial and semantic structure to outperform complex supervised pipelines. This allows practitioners to perform high-quality segmentation at any granularity (objects, parts, or instances) with 3x fewer parameters than current methods.
From the abstract
In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-