We finally have an AI that can pick one stranger's voice out of a crowded bar without ever having heard what they sound like before.
April 6, 2026
Original Paper
Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction
arXiv · 2604.03219
The Takeaway
Current technology usually needs a 'clean' sample of a voice to find it in a crowd; this system skips that requirement entirely. This makes real-time speech isolation significantly more practical for hearing aids and smart assistants.
From the abstract
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher su