Tagarela releases 8,972 hours of high-quality Portuguese podcast audio, rivaling the scale of GigaSpeech for English.
arXiv · March 17, 2026 · 2603.15326
The Takeaway
Portuguese is typically under-resourced in the speech domain; this massive release enables the training of state-of-the-art open-source ASR and TTS models for the language, closing the gap with English-centric research.
From the abstract
Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the cor