CL AI LGAug 6, 2023

Spanish Pre-trained BERT Model and Evaluation Data

José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, Jorge Pérez

arXiv:2308.02976v126.1777 citationsh-index: 3Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited resources for Spanish NLP, providing a model and benchmarks for researchers and practitioners, though it is incremental as it adapts an existing method to a new language.

The authors tackled the lack of resources for Spanish language models by pre-training a BERT model exclusively on Spanish data and compiling Spanish-specific evaluation tasks, achieving better results than multilingual BERT models and setting new state-of-the-art on some tasks.

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.

View on arXiv PDF Code

Similar