AI & ML Open Release

Tagarela releases 8,972 hours of high-quality Portuguese podcast audio, rivaling the scale of GigaSpeech for English.

arXiv · March 17, 2026 · 2603.15326

Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

The Takeaway

Portuguese is typically under-resourced in the speech domain; this massive release enables the training of state-of-the-art open-source ASR and TTS models for the language, closing the gap with English-centric research.

From the abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the cor

Read the original paper →

← Back to today's papers