Gradientdice: Rethinking Generalized Offline Estimation Of Stationary Values
2020 Β· Shangtong Zhang, Bo Liu, Shimon Whiteson
Abstract
We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is n
Authors
(none)
Tags
Stats
Related papers
- Diffusion-dice: In-sample Diffusion Guidance For Offline Reinforcement Learning (2024)0.00
- Dualdice: Behavior-agnostic Estimation Of Discounted Stationary Distribution Corrections (2019)0.00
- ODICE: Revealing The Mystery Of Distribution Correction Estimation Via Orthogonal-gradient Update (2024)0.00
- Optidice: Offline Policy Optimization Via Stationary Distribution Correction Estimation (2021)0.00
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00
- Loaded Dice: Trading Off Bias And Variance In Any-order Score Function Estimators For Reinforcement Learning (2019)0.00
- Softdice For Imitation Learning: Rethinking Off-policy Distribution Matching (2021)0.00