CL AIDec 20, 2024

$π$-yalli: un nouveau corpus pour le nahuatl

Juan-Manuel Torres-Moreno, Juan-José Guzmán-Landa, Graham Ranger, Martha Lorena Avendaño Garrido, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Carlos-Emiliano González-Gallardo, Elvys Linhares Pontes, Patricia Velázquez Morales, Luis-Gil Moreno Jiménez

arXiv:2412.15821v13.41 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This addresses the lack of computational resources for Nahuatl speakers and researchers, though it is incremental as it focuses on data collection rather than novel methods.

The paper introduces the $\pi$-YALLI corpus for Nahuatl, a low-resource language spoken by about 2 million people, to enable the development of language models and NLP tools such as grapheme unifiers and translators.

The NAHU$^2$ project is a Franco-Mexican collaboration aimed at building the $π$-YALLI corpus adapted to machine learning, which will subsequently be used to develop computer resources for the Nahuatl language. Nahuatl is a language with few computational resources, even though it is a living language spoken by around 2 million people. We have decided to build $π$-YALLI, a corpus that will enable to carry out research on Nahuatl in order to develop Language Models (LM), whether dynamic or not, which will make it possible to in turn enable the development of Natural Language Processing (NLP) tools such as: a) a grapheme unifier, b) a word segmenter, c) a POS grammatical analyser, d) a content-based Automatic Text Summarization; and possibly, e) a translator translator (probabilistic or learning-based).

View on arXiv PDF

Similar