Emaq: Expected-max Q-learning Operator For Simple Yet Effective Offline And Online RL
2020 Β· Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, Shixiang Shane Gu
Abstract
Off-policy reinforcement learning holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting -- where a fixed collection of interactions are provided and no further interactions are allowed -- it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods often aim to address this shortcoming by constraining learned policies to remain close to the given dataset of interactions. In this work, we closely investigate an important simplification of BCQ -- a prior approach for offline RL -- which removes a heuristic design choice and naturally restricts extracted policies to remain exactly within the support of a given behavior policy. Importantly, in contrast to their original theoretical considerations, we derive this simplified algorithm through the introduction of a novel backup operator, Expected-Max Q-Learning (EMaQ), which is more closely related to the r
Authors
(none)
Tags
Stats
Related papers
- Mildly Conservative Q-learning For Offline Reinforcement Learning (2022)0.00
- Expert-supervised Reinforcement Learning For Offline Policy Learning And Evaluation (2020)0.00
- Constraints Penalized Q-learning For Safe Offline Reinforcement Learning (2021)0.00
- ACL-QL: Adaptive Conservative Level In Q-learning For Offline Reinforcement Learning (2024)0.00
- Cal-ql: Calibrated Offline RL Pre-training For Efficient Online Fine-tuning (2023)2.26
- Deployment-efficient Reinforcement Learning Via Model-based Offline Optimization (2020)0.00
- Morel : Model-based Offline Reinforcement Learning (2020)0.00
- Model-based Offline Reinforcement Learning With Lower Expectile Q-learning (2024)0.00