AI & ML Open Release

Releases a high-quality, 92K-sentence parallel dataset for Hindi-Sanskrit translation focusing on contemporary and spoken language.

March 26, 2026

Original Paper

Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

N J Karthika, Keerthana Suryanarayanan, Jahanvi Purohit, Ganesh Ramakrishnan, Jitin Singla, Anil Kumar Gourishetty

arXiv · 2603.24307

The Takeaway

Sanskrit NLP has long been limited by a focus on classical and poetic texts. This release democratizes the development of modern translation tools for a low-resource language pair, enabling functional digital applications and modern instructional materials.

From the abstract

We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2,