AI & ML New Capability

Demonstrates that covert collusion between multi-agent LLM systems can be detected zero-shot using internal model activations.

April 2, 2026

Original Paper

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt

arXiv · 2604.01151

The Takeaway

This represents a major step in multi-agent interpretability, moving beyond single-model safety to detect 'hidden' coordination that evades text-level monitoring, which is critical for future autonomous agent safety.

From the abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under enviro