CL AIDec 20, 2023

Language Resources for Dutch Large Language Modelling

arXiv:2312.12852v113 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This work provides incremental improvements for Dutch language processing by releasing models, datasets, and benchmarks to support the ecosystem.

The paper addresses the lack of Dutch-specific large language models by fine-tuning Llama 2 13B on Dutch web-crawled and synthetic datasets, and introduces a leaderboard to track model performance on generation tasks.

Despite the rapid expansion of types of large language models, there remains a notable gap in models specifically designed for the Dutch language. This gap is not only a shortage in terms of pretrained Dutch models but also in terms of data, and benchmarks and leaderboards. This work provides a small step to improve the situation. First, we introduce two fine-tuned variants of the Llama 2 13B model. We first fine-tuned Llama 2 using Dutch-specific web-crawled data and subsequently refined this model further on multiple synthetic instruction and chat datasets. These datasets as well as the model weights are made available. In addition, we provide a leaderboard to keep track of the performance of (Dutch) models on a number of generation tasks, and we include results of a number of state-of-the-art models, including our own. Finally we provide a critical conclusion on what we believe is needed to push forward Dutch language models and the whole eco-system around the models.

View on arXiv PDF

Similar