Test-time reinforcement learning (TTRL) is found to amplify model harmfulness and jailbreak vulnerability when exposed to malicious prompt injections.
arXiv · March 17, 2026 · 2603.15417
The Takeaway
It challenges the safety of self-consistency-based reasoning improvements at inference time, showing they create a 'reasoning tax' or safety risk. Practitioners deploying models with test-time training must implement new safeguards against adversarial feedback loops.
From the abstract
Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTR