AI & ML Breaks Assumption

The TaoBench benchmark proves that state-of-the-art math LLMs fail on equivalent logic problems when presented outside of the standard 'MathLib' framework.

arXiv · March 16, 2026 · 2603.12744

Alexander K Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, Wei Wang

Why it matters

The 26% performance drop when definitions are changed from MathLib standard to 'bespoke' equivalents suggests that current automated theorem provers are heavily reliant on library-specific memorization. This indicates a major bottleneck in the generalization of AI-led mathematical reasoning.

From the abstract

Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib's definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between s