CLDec 4, 2019

A Resource for Computational Experiments on Mapudungun

arXiv:1912.01772v2998 citations
Originality Synthesis-oriented
AI Analysis

This resource addresses a gap for researchers and communities working on low-resource languages, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of computational resources for Mapudungun, an indigenous language, by creating a dataset of 142 hours of transcribed and translated conversations, and provided baseline results for NLP tasks like speech recognition and machine translation.

We present a resource for computational experiments on Mapudungun, a polysynthetic indigenous language spoken in Chile with upwards of 200 thousand speakers. We provide 142 hours of culturally significant conversations in the domain of medical treatment. The conversations are fully transcribed and translated into Spanish. The transcriptions also include annotations for code-switching and non-standard pronunciations. We also provide baseline results on three core NLP tasks: speech recognition, speech synthesis, and machine translation between Spanish and Mapudungun. We further explore other applications for which the corpus will be suitable, including the study of code-switching, historical orthography change, linguistic structure, and sociological and anthropological studies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes