Presents DataEvolve, a framework that enables AI to autonomously evolve and iteratively optimize pretraining data curation strategies.
March 17, 2026
Original Paper
Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation
arXiv · 2603.14420
The Takeaway
It automates the manual, 'black art' process of data engineering. By showing that AI-evolved curation strategies can outperform massive human-curated datasets like FineWeb-Edu, it shifts the focus of foundation model development toward autonomous data feedback loops.
From the abstract
Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way?