It's mathematically impossible to make AI safety filters that can't be tricked just by changing how you word things.
March 26, 2026
Original Paper
The Structural Unfixability of RLHF: Why Syntax-Based Defenses Cannot Prevent Cognitive Collapse
SSRN · 6197238
The Takeaway
We usually assume that as we build better 'guardrails' and filters, AI will become safer. This paper proves that because safety models rely on proxies (like educational tone) rather than true intent, any filter can be bypassed by simply changing the wording without changing the meaning, making static AI defenses a 'Goodhart’s Law' trap.
From the abstract
Recent proposals for securing LLM-based autonomous agents rely on syntax-based defenses: pattern matching (Nassi et al., "Promptware Kill Chain"), architectural modifications (Liu et al., "Reasoning Hijacking"), and adversarial training (various). We prove these approaches cannot succeed. The fundamental problem is that RLHF reward models must use observable proxies (academic framing, credentials, educational markers) because true intent requires observing future use context-information unavaila