Formulates Hierarchical Instruction Following as a Constrained Markov Decision Process to ensure LLMs prioritize system prompts over user instructions.
March 18, 2026
Original Paper
HIPO: Instruction Hierarchy via Constrained Reinforcement Learning
arXiv · 2603.16152
The Takeaway
Instead of hoping the model follows 'don't do X,' HIPO treats system instructions as strict algorithmic boundaries, effectively solving the priority asymmetry problem that leads to jailbreaks and instruction drift.
From the abstract
Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce