CL AIFeb 20, 2024

GlórIA -- A Generative and Open Large Language Model for Portuguese

Ricardo Lopes, João Magalhães, David Semedo

arXiv:2402.12969v19.119 citationsh-index: 7PROPOR

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited LLM availability for European Portuguese speakers and researchers, though it is incremental as it adapts existing methods to a new language.

The authors tackled the lack of large language models for European Portuguese by introducing GlórIA, a decoder LLM pre-trained on 35 billion tokens, which significantly outperforms existing open models in language modeling and generates coherent text.

Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce GlórIA, a robust European Portuguese decoder LLM. To pre-train GlórIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that GlórIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.

View on arXiv PDF

Similar