Q-distribution Guided Q-learning For Offline Reinforcement Learning: Uncertainty Penalized Q-value Via Consistency Model
2024 Β· Jing Zhang, Linjiajie Fang, Kexin Shi, et al.
Abstract
``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The pro
Authors
(none)
Tags
Stats
Related papers
- Diverse Randomized Value Functions: A Provably Pessimistic Approach For Offline Reinforcement Learning (2024)3.58
- Uncertainty-based Offline Reinforcement Learning With Diversified Q-ensemble (2021)0.00
- Mildly Conservative Q-learning For Offline Reinforcement Learning (2022)0.00
- Mitigating Distribution Shift In Model-based Offline RL Via Shifts-aware Reward Learning (2024)0.00
- Confidence-conditioned Value Functions For Offline Reinforcement Learning (2022)0.00
- Q-learning With Shift-aware Upper Confidence Bound In Non-stationary Reinforcement Learning (2025)0.00
- A Perspective Of Q-value Estimation On Offline-to-online Reinforcement Learning (2023)7.81
- Pessimistic Q-learning For Offline Reinforcement Learning: Towards Optimal Sample Complexity (2022)0.00