AI & ML Open Release

OmniVoice is an open-source TTS model scaling to over 600 languages using a novel diffusion language model architecture.

April 2, 2026

Original Paper

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey

arXiv · 2604.00688

The Takeaway

It democratizes high-quality, zero-shot speech synthesis for hundreds of low-resource languages previously ignored by proprietary models. By mapping text directly to multi-codebook acoustic tokens, it bypasses the complex two-stage pipelines common in current TTS systems.

From the abstract

We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: