DROP: Distributional And Regular Optimism And Pessimism For Reinforcement Learning
2024 Β· Taisuke Kobayashi
Abstract
In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been introduced with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular opt
Authors
(none)
Tags
Stats
Related papers
- Pseudo-quantized Actor-critic Algorithm For Robustness To Noisy Temporal Difference Error (2026)0.00
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00
- Pitfall Of Optimism: Distributional Reinforcement Learning By Randomizing Risk Criterion (2023)0.00
- Temporal-difference Learning Using Distributed Error Signals (2024)0.00
- Learning Sparse Representations In Reinforcement Learning (2019)0.00
- Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism (2023)0.00
- Discerning Temporal Difference Learning (2023)0.00
- Moments Matter:stabilizing Policy Optimization Using Return Distributions (2026)0.00