Off-policy Policy Gradient Algorithms By Constraining The State Distribution Shift
2019 Β· Riashat Islam, Komal K. Teru, Deepak Sharma, et al.
Abstract
Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit\{extrapolation error\}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit\{state distribution shift\} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first es
Authors
(none)
Tags
Stats
Related papers
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- Mitigating Distribution Shift In Model-based Offline RL Via Shifts-aware Reward Learning (2024)0.00
- Bridging Distributionally Robust Learning And Offline RL: An Approach To Mitigate Distribution Shift And Partial Data Coverage (2023)0.00
- Optidice: Offline Policy Optimization Via Stationary Distribution Correction Estimation (2021)0.00
- Regularizing A Model-based Policy Stationary Distribution To Stabilize Offline Reinforcement Learning (2022)0.00
- Interpolated Policy Gradient: Merging On-policy And Off-policy Gradient Estimation For Deep Reinforcement Learning (2017)0.00
- State-constrained Offline Reinforcement Learning (2024)0.00
- On-policy Policy Gradient Reinforcement Learning Without On-policy Sampling (2023)0.00