Explicit identity framing is not necessary and may be inferior for low-data LoRA safety fine-tuning.
arXiv · March 17, 2026 · 2603.14723
The Takeaway
This paper challenges the industry-standard 'creed-style' safety alignment, demonstrating that non-identity-based supervision conditions achieve better refusal rates without capability trade-offs. Practitioners can improve model safety by focusing on core rules rather than prescriptive identity framing.
From the abstract
How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate