Policy Agnostic RL: Offline RL And Online RL Fine-tuning Of Any Class And Backbone
2024 Β· Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, et al.
Abstract
Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace t
Authors
(none)
Tags
Stats
Related papers
- Finetuning From Offline Reinforcement Learning: Challenges, Trade-offs And Practical Solutions (2023)0.00
- Learning A Subspace Of Policies For Online Adaptation In Reinforcement Learning (2021)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- A Non-monolithic Policy Approach Of Offline-to-online Reinforcement Learning (2024)0.00
- Active Advantage-aligned Online Reinforcement Learning With Offline Data (2025)0.00
- Adaptive Policy Selection And Fine-tuning Under Interaction Budgets For Offline-to-online Reinforcement Learning (2026)0.00
- AWAC: Accelerating Online Reinforcement Learning With Offline Datasets (2020)0.00
- Towards Fast Safe Online Reinforcement Learning Via Policy Finetuning (2024)0.00