CLJun 6, 2021

Itihasa: A large-scale corpus for Sanskrit to English translation

arXiv:2106.03269v3712 citations
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for Sanskrit translation, addressing a gap in NLP resources for low-resource languages.

The authors introduced Itihasa, a large-scale dataset of 93,000 Sanskrit-to-English translation pairs from Indian epics, and found that state-of-the-art transformer models perform poorly on it, highlighting its complexity.

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes