Adaptive Policy Selection And Fine-tuning Under Interaction Budgets For Offline-to-online Reinforcement Learning
2026 Β· Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, et al.
Abstract
arXiv:2605.05123v1 Announce Type: new Abstract: In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are evaluated via either off-policy evaluation (OPE) or online evaluation (OE). The policy with the highest estimated value is then deployed and continually fine-tuned. However, this setup has two main issues. First, OPE can be unreliable, making it risky to deploy a policy based solely on those estimates, whereas OE may identify a viable policy with substantial online interaction, which could have been used for fine-tuning. Second--and more importantly--it is also often not possible to determine a priori whether a pretrained policy will improve with post-deployment fine-tuning, especially in non-stationary environments. As a result, procedures committing to a single
Authors
(none)
Tags
Stats
Related papers
- Optimistic Critic Reconstruction And Constrained Fine-tuning For General Offline-to-online RL (2024)0.00
- Towards Fast Safe Online Reinforcement Learning Via Policy Finetuning (2024)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- Planning To Go Out-of-distribution In Offline-to-online Reinforcement Learning (2023)0.00
- Towards Robust Offline-to-online Reinforcement Learning Via Uncertainty And Smoothness (2023)5.24
- Finetuning From Offline Reinforcement Learning: Challenges, Trade-offs And Practical Solutions (2023)0.00
- A Perspective Of Q-value Estimation On Offline-to-online Reinforcement Learning (2023)7.81
- Adaptive Replay Buffer For Offline-to-online Reinforcement Learning (2025)0.00