GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training
This provides a permissively licensed, Dutch-first dataset for researchers and developers to build lawful and useful language models, addressing a domain-specific gap in NLP resources.
They tackled the lack of permissively licensed Dutch language data for LLM pre-training by creating the GPT-NL Public Corpus, which includes 36B new Dutch tokens and totals over 500B tokens from multiple languages, all curated for compliance and publicly available.
We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.