CLJun 2, 2025

Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

arXiv:2506.01775v11 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses the digitization of Indigenous language texts for community revitalization efforts, but it is incremental as it builds on existing OCR methods with adaptations.

The paper tackled the problem of digitizing historical Kwak'wala texts, which were scanned but not machine-readable, by applying OCR techniques to over 11 volumes of images, resulting in a pipeline that produces high-quality transcriptions for language revitalization.

Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes