CL LGJan 30, 2024

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

Nicholas Kluge Corrêa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira

arXiv:2401.16640v326 citationsh-index: 5Has CodeMach Learn Appl

AI Analysis

This addresses the need for accessible language models in low-resource settings like Brazilian Portuguese, though it is incremental as it applies existing methods to a new language.

The paper tackled the problem of unequal progress in large language models across languages by developing TeenyTinyLlama, two compact models for Brazilian Portuguese text generation, and released them under an open license for community use.

Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama

View on arXiv PDF Code

Similar