DART enables real-time multi-class detection for open-vocabulary models like SAM3, achieving up to 25x speedup without any weight modifications.
arXiv · March 13, 2026 · 2603.11441
Why it matters
It exploits the class-agnostic nature of visual backbones to share computation across multiple text prompts. This allows 'detect anything' models, previously too slow for production, to run at 15+ FPS on consumer hardware while maintaining SOTA accuracy.
From the abstract
Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present