AI & ML Paradigm Shift

Presents DataEvolve, a framework that enables AI to autonomously evolve and iteratively optimize pretraining data curation strategies.

March 17, 2026

Original Paper

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu

arXiv · 2603.14420

The Takeaway

It automates the manual, 'black art' process of data engineering. By showing that AI-evolved curation strategies can outperform massive human-curated datasets like FineWeb-Edu, it shifts the focus of foundation model development toward autonomous data feedback loops.

From the abstract

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way?