Efficient Offline Reinforcement Learning: First Imitate, Then Improve
2024 Β· Adam Jelley, Trevor McInroe, Sam Devlin, et al.
Abstract
Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and als
Authors
(none)
Tags
Stats
Related papers
- Offline-boosted Actor-critic: Adaptively Blending Optimal Historical Behaviors In Deep Off-policy RL (2024)0.00
- Curriculum Offline Imitation Learning (2021)0.00
- A Policy-guided Imitation Approach For Offline Reinforcement Learning (2022)0.00
- Finetuning From Offline Reinforcement Learning: Challenges, Trade-offs And Practical Solutions (2023)0.00
- Adaptive Behavior Cloning Regularization For Stable Offline-to-online Reinforcement Learning (2022)8.09
- When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning? (2022)0.00
- Bridging Offline Reinforcement Learning And Imitation Learning: A Tale Of Pessimism (2021)0.00
- Beyond Uniform Sampling: Offline Reinforcement Learning With Imbalanced Datasets (2023)2.83