When Policies Cannot Be Retrained: A Unified Closed-form View Of Post-training Steering In Offline Reinforcement Learning

Abstract

arXiv:2604.22873v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with

When Policies Cannot Be Retrained: A Unified Closed-form View Of Post-training Steering In Offline Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers