Pretrain Value, Not Reward: Decoupled Value Policy Optimization
2025 Β· Chenghua Huang, Lu Wang, Fangkai Yang, et al.
Abstract
In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the *return-to-go* of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling. Building on this insight, we introduce *Decoupled Value Policy Optimization* (DVPO),
Authors
(none)
Tags
Stats
Related papers
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Disentangling Dynamics And Returns: Value Function Decomposition With Future Prediction (2019)0.00
- Foresee Then Evaluate: Decomposing Value Estimation With Latent Future Prediction (2021)3.58
- \(V_{0.5}\): Generalist Value Model As A Prior For Sparse RL Rollouts (2026)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- The Value-improvement Path: Towards Better Representations For Reinforcement Learning (2020)6.77
- Rethinking Value Function Learning For Generalization In Reinforcement Learning (2022)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00