The TaoBench benchmark proves that state-of-the-art math LLMs fail on equivalent logic problems when presented outside of the standard 'MathLib' framework.
arXiv · March 16, 2026 · 2603.12744
Why it matters
The 26% performance drop when definitions are changed from MathLib standard to 'bespoke' equivalents suggests that current automated theorem provers are heavily reliant on library-specific memorization. This indicates a major bottleneck in the generalization of AI-led mathematical reasoning.
From the abstract
Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib's definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between s