AI & ML New Capability

OmniSONAR scales cross-lingual sentence embeddings to over 1,500 languages across text, speech, code, and math in a single semantic space.

arXiv · March 18, 2026 · 2603.16606

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

The Takeaway

This represents a massive jump from the typical ~100-200 languages supported by models like LASER or NLLB. It allows for direct translation and similarity search in extremely low-resource languages and across modalities with a 15x error reduction on certain benchmarks.

From the abstract

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-