TD3 With Reverse KL Regularizer For Offline Reinforcement Learning From Mixed Datasets
2022 Β· Yuanying Cai, Chuheng Zhang, Li Zhao, et al.
Abstract
We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e.
Authors
(none)
Tags
Stats
Related papers
- Improving TD3-BC: Relaxed Policy Constraint For Offline Learning And Stable Online Fine-tuning (2022)0.00
- B3C: A Minimalist Approach To Offline Multi-agent Reinforcement Learning (2025)0.00
- BRAC+: Improved Behavior Regularized Actor Critic For Offline Reinforcement Learning (2021)0.00
- Iteratively Refined Behavior Regularization For Offline Reinforcement Learning (2023)2.26
- Policy Regularization With Dataset Constraint For Offline Reinforcement Learning (2023)0.00
- Adaptive Behavior Cloning Regularization For Stable Offline-to-online Reinforcement Learning (2022)8.09
- Regularizing A Model-based Policy Stationary Distribution To Stabilize Offline Reinforcement Learning (2022)0.00
- Beyond Uniform Sampling: Offline Reinforcement Learning With Imbalanced Datasets (2023)2.83