Optidice: Offline Policy Optimization Via Stationary Distribution Correction Estimation
2021 Β· Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, et al.
Abstract
We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline
Authors
(none)
Tags
Stats
Related papers
- Coptidice: Offline Constrained Reinforcement Learning Via Stationary Distribution Correction Estimation (2022)0.00
- Regularizing A Model-based Policy Stationary Distribution To Stabilize Offline Reinforcement Learning (2022)0.00
- Alberdice: Addressing Out-of-distribution Joint Actions In Offline Multi-agent RL Via Alternating Stationary Distribution Correction Estimation (2023)0.00
- Bridging Distributionally Robust Learning And Offline RL: An Approach To Mitigate Distribution Shift And Partial Data Coverage (2023)0.00
- Offline Policy Optimization In RL With Variance Regularizaton (2022)0.00
- Lobsdice: Offline Learning From Observation Via Stationary Distribution Correction Estimation (2022)0.00
- ODICE: Revealing The Mystery Of Distribution Correction Estimation Via Orthogonal-gradient Update (2024)0.00
- Off-policy Policy Gradient Algorithms By Constraining The State Distribution Shift (2019)0.00