Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief
2022 Β· Kaiyang Guo, Yunfeng Shao, Yanhui Geng
Abstract
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While the dynamics model learned through reuse of the static dataset, its generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling natur
Authors
(none)
Tags
Stats
Related papers
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Revisiting Design Choices In Offline Model-based Reinforcement Learning (2021)6.34
- Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage (2023)0.00
- Morel : Model-based Offline Reinforcement Learning (2020)0.00
- DOMAIN: Mildly Conservative Model-based Offline Reinforcement Learning (2023)0.00
- Pessimistic Bootstrapping For Uncertainty-driven Offline Reinforcement Learning (2022)0.00
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- Long-horizon Model-based Offline Reinforcement Learning Without Conservatism (2025)0.00