Breaking The Bias Barrier In Concave Multi-objective Reinforcement Learning

Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility \(f(J_1^\pi,\dots,J_M^\pi)\) over multiple objectives, where each \(J_m^\pi\) denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on \(\partial f(J^\pi)\), while in practice only empirical return estimates \(\hat J\) are available. Because \(f\) is nonlinear, the plug-in estimator is biased (\(\mathbb\{E\}[\partial f(\hat J)] \neq \partial f(\mathbb\{E\}[\hat J])\)), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an i

Breaking The Bias Barrier In Concave Multi-objective Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers