Quantile Reward Policy Optimization: Alignment With Pointwise Regression And Exact Partition Functions
2025 Β· Simon Matrenok, Skander Moalla, Caglar Gulcehre
Abstract
Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode
Authors
(none)
Tags
Stats
Related papers
- Quantile-based Deep Reinforcement Learning Using Two-timescale Policy Gradient Algorithms (2023)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- DISPO: Enhancing Training Efficiency And Stability In Reinforcement Learning For Large Language Model Mathematical Reasoning (2026)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00
- Quantile Q-learning: Revisiting Offline Extreme Q-learning With Quantile Regression (2025)0.00
- Proximal Policy Optimization With Relative Pearson Divergence (2020)6.77
- Reveal The Mystery Of DPO: The Connection Between DPO And RL Algorithms (2025)0.00