economics Paradigm Challenge

It's mathematically impossible to make AI safety filters that can't be tricked just by changing how you word things.

March 26, 2026

Original Paper

The Structural Unfixability of RLHF: Why Syntax-Based Defenses Cannot Prevent Cognitive Collapse

Giuseppe Canale

SSRN · 6197238

The Takeaway

We usually assume that as we build better 'guardrails' and filters, AI will become safer. This paper proves that because safety models rely on proxies (like educational tone) rather than true intent, any filter can be bypassed by simply changing the wording without changing the meaning, making static AI defenses a 'Goodhart’s Law' trap.

From the abstract

Recent proposals for securing LLM-based autonomous agents rely on syntax-based defenses: pattern matching (Nassi et al., "Promptware Kill Chain"), architectural modifications (Liu et al., "Reasoning Hijacking"), and adversarial training (various). We prove these approaches cannot succeed. The fundamental problem is that RLHF reward models must use observable proxies (academic framing, credentials, educational markers) because true intent requires observing future use context-information unavaila

Read the original paper →

← Back to today's papers