AI & ML Breaks Assumption

Demonstrates that LLM reasoning capabilities drop sharply when tasks are framed within multi-turn dialogues vs isolated benchmarks.

March 23, 2026

Original Paper

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartáč, Mateusz Lango, Ondřej Dušek

arXiv · 2603.20133

The Takeaway

Reveals a major gap between high benchmark scores and real-world deployment performance. It shows that multi-turn interaction and role conditioning are significant 'distractors' for current reasoning models, necessitating new evaluation standards.

From the abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framin