Optimistic Critic Reconstruction And Constrained Fine-tuning For General Offline-to-online RL
2024 Β· Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, et al.
Abstract
Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the misaligned critic with the reliable offline actor to avoid erroneous update. After obtaining an optimistic and
Authors
(none)
Tags
Stats
Related papers
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- Adaptive Policy Selection And Fine-tuning Under Interaction Budgets For Offline-to-online Reinforcement Learning (2026)0.00
- A Perspective Of Q-value Estimation On Offline-to-online Reinforcement Learning (2023)7.81
- Planning To Go Out-of-distribution In Offline-to-online Reinforcement Learning (2023)0.00
- Towards Fast Safe Online Reinforcement Learning Via Policy Finetuning (2024)0.00
- Towards Robust Offline-to-online Reinforcement Learning Via Uncertainty And Smoothness (2023)5.24
- Finetuning From Offline Reinforcement Learning: Challenges, Trade-offs And Practical Solutions (2023)0.00
- Leveraging Offline Data In Online Reinforcement Learning (2022)0.00