CLMar 13, 2025

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson

arXiv:2503.10267v320.917 citationsh-index: 41Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work provides a valuable resource for researchers and developers in natural language processing by offering an expanded dataset, though it is incremental as it builds on prior HPLT project efforts.

The authors tackled the challenge of building multilingual datasets for training large language models by presenting HPLT v2, a collection of high-quality monolingual and parallel corpora, resulting in 8T tokens for 193 languages and 380M sentence pairs for 51 languages, with evaluations showing improved performance in language models and machine translation systems.

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

View on arXiv PDF Code

Similar