CL AI LGMar 27

AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, Inês Calvo, Inês Vieira, Rui Guerra, James Furtado, Beatriz Canaverde

arXiv:2603.2651172.9h-index: 12

AI Analysis

This addresses the problem of linguistic and cultural nuances for European Portuguese speakers, though it is incremental as it adapts existing methods to a specific language variant.

The paper tackled the underrepresentation of European Portuguese in large language models by introducing AMALIA, a fully open model that uses more high-quality pt-PT data, and it matched baselines on translated benchmarks while substantially improving performance on new pt-PT-specific evaluations.

Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

View on arXiv PDF

Similar