CLJul 1, 2024

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

arXiv:2407.17479v13 citationsh-index: 6Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of underrepresentation in AI systems for 600 million Spanish speakers across LATAM, the Caribbean, and Spain, providing essential resources for NLP advancement in these languages.

The paper tackles the lack of open datasets and leaderboards for instruction-tuning and evaluating large language models in Spanish and its regional variants, presenting the first versions of instruction and evaluation datasets created by an international open-source community.

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes