AI & ML Breaks Assumption

Uses token-level perplexity analysis to prove that LLMs rely on simple heuristics rather than the linguistic reasoning they appear to exhibit on standard benchmarks.

April 1, 2026

Original Paper

Is my model perplexed for the right reason? Contrasting LLMs' Benchmark Behavior with Token-Level Perplexity

Zoë Prins, Samuele Punzo, Frank Wildenburg, Giovanni Cinà, Sandro Pezzelle

arXiv · 2603.29396

The Takeaway

This work exposes a significant gap between 'correct performance' and 'correct reasoning'. It provides researchers with a principled way to test if a model actually understands linguistic structures or is merely exploiting statistical shortcuts at the token level.

From the abstract

Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal' tokens, our method enables precise,