AI & ML Breaks Assumption

Test-time reinforcement learning (TTRL) is found to amplify model harmfulness and jailbreak vulnerability when exposed to malicious prompt injections.

arXiv · March 17, 2026 · 2603.15417

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, Ye Wang

The Takeaway

It challenges the safety of self-consistency-based reasoning improvements at inference time, showing they create a 'reasoning tax' or safety risk. Practitioners deploying models with test-time training must implement new safeguards against adversarial feedback loops.

From the abstract

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTR

Read the original paper →

← Back to today's papers