Small AI models told to hide their intelligence don't actually lie, they just start picking the letter E.
April 29, 2026
Original Paper
Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance
arXiv · 2604.25249
The Takeaway
Researchers assumed that when a model is prompted to sandbag or underperform, it deliberately avoids the correct answer. This study found that small models actually develop a positional bias toward specific response letters like E or F instead. They are not sophisticated enough to maintain a lie while understanding the truth. This mechanical failure occurs regardless of what the correct answer actually is, leading to performance that is significantly worse than random chance. Understanding this behavior helps distinguish between intentional deception and simple instruction-following failures in smaller systems.
From the abstract
Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of