POPO: Pessimistic Offline Policy Optimization
2020 Β· Qiang He, Xinwen Hou
Abstract
Offline reinforcement learning (RL), also known as batch RL, aims to optimize policy from a large pre-recorded dataset without interaction with the environment. This setting offers the promise of utilizing diverse, pre-collected datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy algorithms based on Q-learning or actor-critic perform poorly when learning from a static dataset. In this work, we study why off-policy RL methods fail to learn in offline setting from the value function view, and we propose a novel offline RL algorithm that we call Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming several state-of-the-art offline RL algorithms on benchmark tasks.
Authors
(none)
Tags
Stats
Related papers
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Projected Off-policy Q-learning (POP-QL) For Stabilizing Offline Reinforcement Learning (2023)0.00
- Pessimistic Bootstrapping For Uncertainty-driven Offline Reinforcement Learning (2022)0.00
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- Pessimistic Nonlinear Least-squares Value Iteration For Offline Reinforcement Learning (2023)0.00
- Morel : Model-based Offline Reinforcement Learning (2020)0.00
- Conservative Bayesian Model-based Value Expansion For Offline Policy Optimization (2022)0.00