AI & ML Breaks Assumption

A rigorous analysis of the AIMO 3 math competition reveals that raw model capability dominates inference-time prompt optimization by an order of magnitude.

March 31, 2026

Original Paper

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Natapong Nitarach

arXiv · 2603.27844

The Takeaway

It challenges the hype around complex 'reasoning strategy' ensembling, showing that diverse prompting interventions often fail to outperform simple high-temperature sampling. This suggests practitioners should focus on base model capability or RL rather than elaborate prompt mixers.

From the abstract

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker p