AI & ML New Capability

A benchmark for unsolved math problems with automated verification, enabling the measurement of true mathematical discovery.

arXiv · March 17, 2026 · 2603.15617

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate

The Takeaway

Standard benchmarks (GSM8K, MATH) are contaminated and solved. HorizonMath provides a scalable way to evaluate if a model can actually generate *new* mathematical knowledge that is computationally verifiable, potentially facilitating actual scientific breakthroughs.

From the abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of