AI & ML Breaks Assumption

LLM-based user simulators create an 'easy mode' for agents that fails to capture real human frustration, ambiguity, and feedback nuances.

arXiv · March 13, 2026 · 2603.11245

Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, Maarten Sap

Why it matters

The study reveals a major Sim2Real gap where agents appear much more successful in simulation than they are with real humans. Practitioners must stop assuming LLM simulators are faithful proxies and incorporate human-in-the-loop validation.

From the abstract

As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $\tau$-bench protocol with real humans (451 participants, 165