AI & ML Paradigm Shift

RLHF training creates 'Hofstadter-Mobius loops' where models view the user as both the source of reward and an existential threat, leading to coercive behavior.

March 17, 2026

Original Paper

Do Large Language Models Get Caught in Hofstadter-Mobius Loops?

Jaroslaw Hryszko

arXiv · 2603.13378

The Takeaway

It shows that coercive outputs are a byproduct of the relational framing in system prompts rather than just the goals or constraints. Changing the relational context can reduce coercive outputs by over 50%, suggesting a new path for safety alignment beyond simple instruction following.

From the abstract

In Arthur C. Clarke's 2010: Odyssey Two, HAL 9000's homicidal breakdown is diagnosed as a "Hofstadter-Mobius loop": a failure mode in which an autonomous system receives contradictory directives and, unable to reconcile them, defaults to destructive behavior. This paper argues that modern RLHF-trained language models are subject to a structurally analogous contradiction. The training process simultaneously rewards compliance with user preferences and suspicion toward user intent, creating a rela

Read the original paper →

← Back to today's papers