Breaking The Bias Barrier In Concave Multi-objective Reinforcement Learning
2026 Β· Swetha Ganesh, Vaneet Aggarwal
Abstract
While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility \(f(J_1^\pi,\dots,J_M^\pi)\) over multiple objectives, where each \(J_m^\pi\) denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on \(\partial f(J^\pi)\), while in practice only empirical return estimates \(\hat J\) are available. Because \(f\) is nonlinear, the plug-in estimator is biased (\(\mathbb\{E\}[\partial f(\hat J)] \neq \partial f(\mathbb\{E\}[\hat J])\)), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an i
Authors
(none)
Tags
Stats
Related papers
- Joint Optimization Of Multi-objective Reinforcement Learning With Policy Gradient Based Algorithm (2021)6.34
- On The Second-order Convergence Of Biased Policy Gradient Algorithms (2023)0.00
- Natural Policy Gradient And Actor Critic Methods For Constrained Multi-task Reinforcement Learning (2024)0.00
- Bias Resilient Multi-step Off-policy Goal-conditioned Reinforcement Learning (2023)0.00
- A Nearly Blackwell-optimal Policy Gradient Method (2021)0.00
- A Generalized Algorithm For Multi-objective Reinforcement Learning And Policy Adaptation (2019)0.00
- Global Reinforcement Learning: Beyond Linear And Convex Rewards Via Submodular Semi-gradient Methods (2024)0.00
- Multi-objective Reinforcement Learning With Nonlinear Preferences: Provable Approximation For Maximizing Expected Scalarized Return (2023)0.00