When Policies Cannot Be Retrained: A Unified Closed-form View Of Post-training Steering In Offline Reinforcement Learning
2026 Β· Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, et al.
Abstract
arXiv:2604.22873v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with
Authors
(none)
Tags
Stats
Related papers
- Adaptive Policy Selection And Fine-tuning Under Interaction Budgets For Offline-to-online Reinforcement Learning (2026)0.00
- Policy Agnostic RL: Offline RL And Online RL Fine-tuning Of Any Class And Backbone (2024)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- PROTO: Iterative Policy Regularized Offline-to-online Reinforcement Learning (2023)0.00
- Finetuning From Offline Reinforcement Learning: Challenges, Trade-offs And Practical Solutions (2023)0.00
- Regularizing A Model-based Policy Stationary Distribution To Stabilize Offline Reinforcement Learning (2022)0.00
- Learning A Subspace Of Policies For Online Adaptation In Reinforcement Learning (2021)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24