RUDDER: Return Decomposition For Delayed Rewards
2018 Β· Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, et al.
Abstract
We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed reward
Authors
(none)
Tags
Stats
Related papers
- Learning Long-term Reward Redistribution Via Randomized Return Decomposition (2021)0.00
- Reinforcement Learning With Delayed, Composite, And Partially Anonymous Reward (2023)0.00
- Interpretable Reward Redistribution In Reinforcement Learning: A Causal Approach (2023)2.26
- Burning RED: Unlocking Subtask-driven Reinforcement Learning And Risk-awareness In Average-reward Markov Decision Processes (2024)0.00
- Episodic Return Decomposition By Difference Of Implicitly Assigned Sub-trajectory Reward (2023)0.00
- Revisiting State Augmentation Methods For Reinforcement Learning With Stochastic Delays (2021)10.35
- Regret-optimal Model-free Reinforcement Learning For Discounted Mdps With Short Burn-in Time (2023)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00