SDCLASJul 14, 2025

Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition

arXiv:2507.10827v2h-index: 20
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of language documentation for the endangered SENCOTEN language, supporting community revitalization efforts, but it is incremental as it applies existing ASR methods to a new domain.

The paper tackles the challenge of developing automatic speech recognition (ASR) for the SENCOTEN language, which has limited data and complex linguistic features, by proposing an ASR-driven pipeline using text-to-speech augmentation and cross-lingual transfer learning, achieving a word error rate of 14.32% and character error rate of 3.45% after filtering minor errors.

The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes