Sāmayik: A Benchmark and Dataset for English-Sanskrit Translation
This addresses the low-resource challenge for Sanskrit translation, particularly for contemporary prose, though it is incremental as it builds on existing multilingual models.
The authors tackled the problem of limited digitized content for Sanskrit translation by creating Sāmayik, a dataset of 53,000 parallel English-Sanskrit sentences in contemporary prose, and showed that models trained on it outperform those on classical poetry datasets with statistically significant improvements on out-of-domain contemporary corpora.
We release Sāmayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on poetry and offer limited coverage of contemporary written materials. Sāmayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials, among others. It stands out as a unique resource that specifically caters to the contemporary usage of Sanskrit, with a primary emphasis on prose writing. Translation models trained on our dataset demonstrate statistically significant improvements when translating out-of-domain contemporary corpora, outperforming models trained on older classical-era poetry datasets. Finally, we also release benchmark models by adapting four multilingual pre-trained models, three of them have not been previously exposed to Sanskrit for translating between English and Sanskrit while one of them is multi-lingual pre-trained translation model including English and Sanskrit. The dataset and source code is present at https://github.com/ayushbits/saamayik.