CLAIDec 20, 2024

$π$-yalli: un nouveau corpus pour le nahuatl

arXiv:2412.15821v11 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This addresses the lack of computational resources for Nahuatl speakers and researchers, though it is incremental as it focuses on data collection rather than novel methods.

The paper introduces the $\pi$-YALLI corpus for Nahuatl, a low-resource language spoken by about 2 million people, to enable the development of language models and NLP tools such as grapheme unifiers and translators.

The NAHU$^2$ project is a Franco-Mexican collaboration aimed at building the $π$-YALLI corpus adapted to machine learning, which will subsequently be used to develop computer resources for the Nahuatl language. Nahuatl is a language with few computational resources, even though it is a living language spoken by around 2 million people. We have decided to build $π$-YALLI, a corpus that will enable to carry out research on Nahuatl in order to develop Language Models (LM), whether dynamic or not, which will make it possible to in turn enable the development of Natural Language Processing (NLP) tools such as: a) a grapheme unifier, b) a word segmenter, c) a POS grammatical analyser, d) a content-based Automatic Text Summarization; and possibly, e) a translator translator (probabilistic or learning-based).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes