AI & ML Efficiency Breakthrough

DART enables real-time multi-class detection for open-vocabulary models like SAM3, achieving up to 25x speedup without any weight modifications.

arXiv · March 13, 2026 · 2603.11441

Mehmet Kerem Turkcan

Why it matters

It exploits the class-agnostic nature of visual backbones to share computation across multiple text prompts. This allows 'detect anything' models, previously too slow for production, to run at 15+ FPS on consumer hardware while maintaining SOTA accuracy.

From the abstract

Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present

Read the original paper →

← Back to today's papers