A self-supervised robotic system detects novel objects by training bespoke detectors on-the-fly from human video demonstrations, bypassing language-based prompts.
arXiv · March 16, 2026 · 2603.12751
Why it matters
Current open-vocabulary detectors often require tedious prompt engineering to recognize specific instances. This 'Show, Don't Tell' approach allows a robot to automatically generate a tailored dataset and detector in minutes, significantly improving task completion in unconstrained environments.
From the abstract
How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions an