CLJan 26, 2021

A Digital Corpus of St. Lawrence Island Yupik

arXiv:2101.10496v1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of limited digital access to Yupik texts for educators, community members, and researchers, though it is incremental as it applies existing digitization methods to new data.

The authors tackled the lack of digital resources for the endangered St. Lawrence Island Yupik language by creating the first publicly available digital corpus using a step-by-step digitization pipeline, enabling future linguistic and NLP research and supporting language education and revitalization.

St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes