AI & ML New Capability

A self-supervised robotic system detects novel objects by training bespoke detectors on-the-fly from human video demonstrations, bypassing language-based prompts.

arXiv · March 16, 2026 · 2603.12751

James Akl, Jose Nicolas Avendano Arbelaez, James Barabas, Jennifer L. Barry, Kalie Ching, Noam Eshed, Jiahui Fu, Michel Hidalgo, Andrew Hoelscher, Tushar Kusnur, Andrew Messing, Zachary Nagler, Brian Okorn, Mauro Passerino, Tim J. Perkins, Eric Rosen, Ankit Shah, Tanmay Shankar, Scott Shaw

Why it matters

Current open-vocabulary detectors often require tedious prompt engineering to recognize specific instances. This 'Show, Don't Tell' approach allows a robot to automatically generate a tailored dataset and detector in minutes, significantly improving task completion in unconstrained environments.

From the abstract

How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions an