CLApr 8

Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

arXiv:2604.070153.5
AI Analysis

This addresses data scarcity for Nahuatl speakers and NLP applications, but it is incremental as it applies an existing technique to a new language context.

The study tackled the problem of limited training data for low-resource languages like Nahuatl by exploring data duplication, finding that incremental duplication of the Nahuatl corpus led to moderate performance improvements in semantic similarity tasks.

In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes