Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility \(f(J_1^\pi,\dots,J_M^\pi)\) over multiple objectives, where each \(J_m^\pi\) denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on \(\partial f(J^\pi)\), while in practice only empirical return estimates \(\hat J\) are available. Because \(f\) is nonlinear, the plug-in estimator is biased (\(\mathbb\{E\}[\partial f(\hat J)] \neq \partial f(\mathbb\{E\}[\hat J])\)), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an i

Authors

(none)

Tags

  • Policy Gradient

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyganesh2026breaking

Related papers