CLOct 6, 2025

How I Built ASR for Endangered Languages with a Spoken Dictionary

arXiv:2510.04832v14.91 citationsh-index: 1

Originality Incremental advance

AI Analysis

This reduces the barrier to entry for ASR in critically endangered languages, offering hope to communities lacking resources for standard data formats.

The paper tackled the problem of building automatic speech recognition (ASR) for endangered languages with limited data, showing that 40 minutes of short-form pronunciation data can produce usable ASR for Manx Gaelic with less than 50% word error rate, and replicated this for Cornish.

Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50\% WER). We replicate our approach, applying it to Cornish ($\sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

View on arXiv PDF

Similar