Beyond Variance Reduction: Understanding The True Impact Of Baselines On Policy Optimization
2020 Β· Wesley Chung, Valentin Thomas, Marlos C. MacHado, et al.
Abstract
Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. Thes
Authors
(none)
Tags
Stats
Related papers
- Variance Reduction For Policy Gradient With Action-dependent Factorized Baselines (2018)0.00
- Meta-learning Bandit Policies By Gradient Ascent (2020)0.00
- Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy (2018)0.00
- Where Did My Optimum Go?: An Empirical Analysis Of Gradient Descent Optimization In Policy Gradient Methods (2018)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Variance Reduction For Score Functions Using Optimal Baselines (2022)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00