Policy Filtration For RLHF To Mitigate Noise In Reward Models
2024 Β· Chuheng Zhang, Wei Shen, Li Zhao, et al.
Abstract
While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination (R2) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real per
Authors
(none)
Tags
Stats
Related papers
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Can RLHF Be More Efficient With Imperfect Reward Models? A Policy Coverage Perspective (2025)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- Dataset Reset Policy Optimization For RLHF (2024)3.01
- Provably Mitigating Overoptimization In RLHF: Your SFT Loss Is Implicitly An Adversarial Regularizer (2024)0.00
- SAFE: Stable Alignment Finetuning With Entropy-aware Predictive Control For Reinforcement Learning From Human Feedback (RLHF) (2026)0.00