Develops a differentially private RLHF pipeline that decouples private reward learning from policy optimization, achieving strong alignment on Gemma-2B-IT with privacy guarantees.
March 25, 2026
Original Paper
Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
arXiv · 2603.22563
The Takeaway
RLHF often uses sensitive human preference data; this paper provides a practical way to ensure the resulting model doesn't leak that information. The decoupled approach is significant because it allows the computationally expensive RL phase to remain non-private while protecting the sensitive data-dependent reward model.
From the abstract
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward mode