CL AI LGOct 11, 2023

On the Impact of Cross-Domain Data on German Language Models

Amin Dada, Aokun Chen, Cheng Peng, Kaleb E Smith, Ahmad Idrissi-Yaghir, Constantin Marc Seibold, Jianning Li, Lars Heiliger, Xi Yang, Christoph M. Friedrich, Daniel Truhn, Jan Egger

arXiv:2310.07321v221.2131 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

This work addresses the data selection challenge for German language models, showing incremental benefits of cross-domain data over quality-focused datasets.

The study tackled the problem of whether cross-domain data diversity is more beneficial than high-quality data for training German language models, finding that models trained on cross-domain data outperformed those on quality data alone, achieving up to 4.45% improvement over previous state-of-the-art.

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen

View on arXiv PDF

Similar