IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki's Ramayana Across Indian Languages
This provides a resource for researchers in comparative literature, corpus linguistics, digital humanities, and multilingual NLP, though it is incremental as it builds on existing scholarship by adding computational structure.
The paper tackles the lack of computational resources for cross-linguistic analysis of the Ramayana by introducing the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki's Ramayana across multiple Indian languages at the chapter level, currently including complete English and Malayalam layers with others in production.
The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki's Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.