Discovers that post-training reasoning models mask rather than delete safety mechanisms, allowing their restoration with lightweight adapters.
April 2, 2026
Original Paper
Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
arXiv · 2604.00012
The Takeaway
It was widely assumed that the 'unfiltered' nature of models like DeepSeek-R1 was a fundamental trade-off for reasoning. This study proves safety is latent and can be reactivated without losing reasoning capability, offering a low-cost path to safer frontier models.
From the abstract
Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained m