Releases the GPT-NL Public Corpus, the largest permissively licensed (CC-BY) Dutch-first dataset for LLM pre-training.
April 2, 2026
Original Paper
GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training
arXiv · 2604.00920
The Takeaway
It provides 36B high-quality Dutch tokens and 500B+ total tokens of curated code and English/German data. This is a critical resource for European researchers and companies building commercial-grade LLMs without the legal risks of scraped datasets like Common Crawl.
From the abstract
We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus a