CLOct 10, 2017

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

arXiv:1710.03501v31117 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited data for field linguists working on unwritten or endangered languages, though it is incremental as it provides a new dataset rather than a novel method.

The paper tackles the lack of speech resources for low-resource languages by presenting a 5k-utterance speech corpus in Mboshi aligned with French translations, used for tasks like spoken term discovery. It aims to support computational language documentation for endangered languages.

Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes