OmniSONAR scales cross-lingual sentence embeddings to over 1,500 languages across text, speech, code, and math in a single semantic space.
arXiv · March 18, 2026 · 2603.16606
The Takeaway
This represents a massive jump from the typical ~100-200 languages supported by models like LASER or NLLB. It allows for direct translation and similarity search in extremely low-resource languages and across modalities with a 15x error reduction on certain benchmarks.
From the abstract
Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-