ODICE: Revealing The Mystery Of Distribution Correction Estimation Via Orthogonal-gradient Update
2024 Β· Liyuan Mao, Haoran Xu, Weinan Zhang, et al.
Abstract
In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modi
Authors
(none)
Tags
Stats
Related papers
- Diffusion-dice: In-sample Diffusion Guidance For Offline Reinforcement Learning (2024)0.00
- Optidice: Offline Policy Optimization Via Stationary Distribution Correction Estimation (2021)0.00
- Gradientdice: Rethinking Generalized Offline Estimation Of Stationary Values (2020)0.00
- Dualdice: Behavior-agnostic Estimation Of Discounted Stationary Distribution Corrections (2019)0.00
- Alberdice: Addressing Out-of-distribution Joint Actions In Offline Multi-agent RL Via Alternating Stationary Distribution Correction Estimation (2023)0.00
- Coptidice: Offline Constrained Reinforcement Learning Via Stationary Distribution Correction Estimation (2022)0.00
- Lobsdice: Offline Learning From Observation Via Stationary Distribution Correction Estimation (2022)0.00
- Off-policy Reinforcement Learning With Optimistic Exploration And Distribution Correction (2021)0.00