Two CFG Nahuatl for automatic corpora expansion
This work addresses the challenge of limited corpora for low-resource languages like Nawatl, enabling better embeddings for NLP applications, though it is incremental as it applies existing CFG methods to a new language.
The authors tackled the lack of digital resources for Nawatl, a low-resource language, by introducing two new Context-Free Grammars to generate artificial sentences for corpus expansion, resulting in improved embeddings that outperformed some LLMs in semantic similarity tasks.
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.