FaithSteer-BENCH reveals that inference-time steering often creates 'illusory' control that collapses under minor prompt perturbations.
March 20, 2026
Original Paper
FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering
arXiv · 2603.18329
The Takeaway
It demonstrates that common activation-level interventions induce prompt-conditional alignment rather than stable latent shifts. This is a critical warning for researchers relying on steering as a lightweight alternative to fine-tuning for safety or behavioral control.
From the abstract
Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbf{FaithSteer-BENCH}, a