Long-horizon Model-based Offline Reinforcement Learning Without Conservatism
2025 Β· Tianwei Ni, Esther Derman, Vineet Jain, et al.
Abstract
Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale this principle to realistic tasks and show that long-horizon planning is critical for reducing value overestimation once conservatism is removed. To make this feasible, we introduce key design choices for performing and learning from long-horizon rollouts while controlling compounding errors. These yield our algorit
Authors
(none)
Tags
Stats
Related papers
- DOMAIN: Mildly Conservative Model-based Offline Reinforcement Learning (2023)0.00
- Plan Better Amid Conservatism: Offline Multi-agent Reinforcement Learning With Actor Rectification (2021)0.00
- Conservative Bayesian Model-based Value Expansion For Offline Policy Optimization (2022)0.00
- Compositional Conservatism: A Transductive Approach In Offline Reinforcement Learning (2024)1.81
- Mildly Conservative Q-learning For Offline Reinforcement Learning (2022)0.00
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00
- MICRO: Model-based Offline Reinforcement Learning With A Conservative Bellman Operator (2023)0.00
- Revisiting Design Choices In Offline Model-based Reinforcement Learning (2021)6.34