CLAIOct 6, 2025

A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

arXiv:2510.04945v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the problem of limited corpora for low-resource languages like Nawatl, enabling better machine learning models, though it is incremental as it notes the need for more effective grammars.

The authors tackled the lack of digital resources for the Nawatl language by introducing a context-free grammar to generate grammatically correct artificial sentences, which expanded the corpus and led to comparative improvements over some LLMs in training algorithms like FastText for semantic tasks.

In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $π$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $π$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes