Discor: Corrective Feedback In Reinforcement Learning Via Distribution Correction
2020 Β· Aviral Kumar, Abhishek Gupta, Sergey Levine
Abstract
Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or
Authors
(none)
Tags
Stats
Related papers
- Off-policy Reinforcement Learning With Optimistic Exploration And Distribution Correction (2021)0.00
- Distributional Soft Actor-critic: Off-policy Reinforcement Learning For Addressing Value Estimation Errors (2020)17.77
- Stratified Experience Replay: Correcting Multiplicity Bias In Off-policy Reinforcement Learning (2021)0.00
- Conjugated Discrete Distributions For Distributional Reinforcement Learning (2021)0.00
- Mitigating Off-policy Bias In Actor-critic Methods With One-step Q-learning: A Novel Correction Approach (2022)0.00
- Off-policy Deep Reinforcement Learning Without Exploration (2018)0.00
- Q-distribution Guided Q-learning For Offline Reinforcement Learning: Uncertainty Penalized Q-value Via Consistency Model (2024)0.00
- The Distributional Reward Critic Framework For Reinforcement Learning Under Perturbed Rewards (2024)0.00