A Perspective Of Q-value Estimation On Offline-to-online Reinforcement Learning
2023 Β· Yinmin Zhang, Jie Liu, Chuming Li, et al.
Abstract
Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value upd
Authors
(none)
Tags
Stats
Related papers
- Optimistic Critic Reconstruction And Constrained Fine-tuning For General Offline-to-online RL (2024)0.00
- Towards Robust Offline-to-online Reinforcement Learning Via Uncertainty And Smoothness (2023)5.24
- Adaptive Policy Selection And Fine-tuning Under Interaction Budgets For Offline-to-online Reinforcement Learning (2026)0.00
- Towards Fast Safe Online Reinforcement Learning Via Policy Finetuning (2024)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- Planning To Go Out-of-distribution In Offline-to-online Reinforcement Learning (2023)0.00
- Diverse Randomized Value Functions: A Provably Pessimistic Approach For Offline Reinforcement Learning (2024)3.58
- Mildly Conservative Q-learning For Offline Reinforcement Learning (2022)0.00