Releases a high-quality, 92K-sentence parallel dataset for Hindi-Sanskrit translation focusing on contemporary and spoken language.
March 26, 2026
Original Paper
Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
arXiv · 2603.24307
The Takeaway
Sanskrit NLP has long been limited by a focus on classical and poetic texts. This release democratizes the development of modern translation tools for a low-resource language pair, enabling functional digital applications and modern instructional materials.
From the abstract
We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2,