AI & ML New Capability

By fine-tuning on categorical refusal tokens, researchers can extract steerable directions to control fine-grained refusal behavior during inference.

arXiv · March 17, 2026 · 2603.13359

Rishab Alagharu, Ishneet Sukhvinder Singh, Shaibi Shamsudeen, Zhen Wu, Ashwinee Panda

The Takeaway

It provides a method to reduce over-refusal on benign prompts while strengthening refusals on harmful ones without retraining. The ability to transfer these steering vectors across same-architecture models enables modular safety layers for LLM deployments.

From the abstract

Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal tokens that distinguish different refusal types before responding. In this work, we leverage a version of Llama 3 8B fine-tuned with these categorical refusal tokens to enable inference-time control over fine-grained refusal behavior, improving both safety and reliability. We show that refusal token fine-tuning induces separable, category-aligned di

Read the original paper →

← Back to today's papers