AI & ML Open Release

Releases the GPT-NL Public Corpus, the largest permissively licensed (CC-BY) Dutch-first dataset for LLM pre-training.

April 2, 2026

Original Paper

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

Jesse van Oort, Frank Brinkkemper, Erik de Graaf, Bram Vanroy, Saskia Lensink

arXiv · 2604.00920

The Takeaway

It provides 36B high-quality Dutch tokens and 500B+ total tokens of curated code and English/German data. This is a critical resource for European researchers and companies building commercial-grade LLMs without the legal risks of scraped datasets like Common Crawl.

From the abstract

We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus a

Read the original paper →

← Back to today's papers