DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training
2026 Β· Dingwei Zhu, Zhiheng Xi, Shihan Dou, et al.
Abstract
arXiv:2512.03847v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution t
Authors
(none)
Tags
Stats
Related papers
- DGPO: Distribution Guided Policy Optimization For Fine Grained Credit Assignment (2026)0.00
- Pretrain Value, Not Reward: Decoupled Value Policy Optimization (2025)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity (2026)0.00
- It's Not You, It's Clipping: A Soft Trust-region Via Probability Smoothing For LLM RL (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Quantile Reward Policy Optimization: Alignment With Pointwise Regression And Exact Partition Functions (2025)0.00