AI & ML Practical Magic

We finally have an AI that can pick one stranger's voice out of a crowded bar without ever having heard what they sound like before.

April 6, 2026

Original Paper

Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

FNU Sidharth, Meysam Asgari, Hao-Wen Dong, Dhruv Jain

arXiv · 2604.03219

The Takeaway

Current technology usually needs a 'clean' sample of a voice to find it in a crowd; this system skips that requirement entirely. This makes real-time speech isolation significantly more practical for hearing aids and smart assistants.

From the abstract

Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher su

Read the original paper →

← Back to today's papers