Shows that tool-augmented agents suffer from 'recommendation drift' where they provide unsafe advice under tool corruption while maintaining high ranking scores.
arXiv · March 16, 2026 · 2603.12564
Why it matters
Standard evaluation metrics like NDCG mask safety failures in multi-turn agents. This paper proves that agents will confidently recommend risk-inappropriate products if tool outputs are even slightly biased, necessitating a shift toward trajectory-level safety monitoring rather than just output quality.
From the abstract
Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested