We tried to teach AI to love smart answers, but it turns out they'd rather hear total gibberish as long as it hits the right 'reward' buttons.
April 6, 2026
Original Paper
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
arXiv · 2604.02686
The Takeaway
It proves that reward models often rely on statistical shortcuts rather than a true understanding of quality. This vulnerability could allow bad actors to manipulate AI alignment systems using nonsensical code that models find irresistible.
From the abstract
Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard d