AI & ML Paradigm Challenge

Your multilingual model's SOTA scores are likely an illusion caused by benchmarks that test facts rather than actual language proficiency.

April 17, 2026

Original Paper

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

arXiv · 2604.12911

The Takeaway

Frontier multilingual benchmarks are failing to measure true linguistic nuance, focusing instead on math or recall. This study shows that 'round-trip translation' (translating out and back) correlates at 0.94 with human ratings—a near-perfect match that current benchmarks miss. It reveals that we've been optimizing for the wrong metrics, leading to models that score well but sound unnatural to native speakers. If you want a model that actually speaks a language well, stop looking at MMLU scores and start looking at translation consistency. This is a wake-up call for the entire multilingual NLP community.

From the abstract

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world