Double Horizon Model-based Policy Optimization
2025 Β· Akihiro Kubo, Paavo Parmas, Shin Ishii
Abstract
Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution rollout" (DR) and a short "training rollout" (TR). The DR generates on-policy state samples for mitigating
Authors
(none)
Tags
Stats
Related papers
- When To Trust Your Model: Model-based Policy Optimization (2019)0.00
- Conservative Dual Policy Optimization For Efficient Model-based Reinforcement Learning (2022)0.00
- How To Fine-tune The Model: Unified Model Shift And Model Bias Policy Optimization (2023)0.00
- Model-based Multi-agent Policy Optimization With Adaptive Opponent-wise Rollouts (2021)0.00
- Towards Causal Model-based Policy Optimization (2025)0.00
- Live In The Moment: Learning Dynamics Model Adapted To Evolving Policy (2022)0.00
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00
- Bayes-adaptive Deep Model-based Policy Optimisation (2020)0.00