Privacy-preserving Reinforcement Learning From Human Feedback Via Decoupled Reward Modeling
2026 Β· Young Hyun Cho, Will Wei Sun
Abstract
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF
Authors
(none)
Tags
Stats
Related papers
- Efficient Differentially Private Fine-tuning Of Llms Via Reinforcement Learning (2025)0.00
- Offline Reinforcement Learning With Differential Privacy (2022)0.00
- Privacy-preserving Reinforcement Learning Beyond Expectation (2022)0.00
- Preserving Expert-level Privacy In Offline Reinforcement Learning (2024)0.00
- Local Differential Privacy For Regret Minimization In Reinforcement Learning (2020)0.00
- Near-optimal Differentially Private Reinforcement Learning (2022)0.00
- Privacy Preserving Reinforcement Learning For Population Processes (2024)0.00
- Locally Private Distributed Reinforcement Learning (2020)0.00