Multi-step Reinforcement Learning: A Unifying Algorithm
2017 Β· Kristopher de Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, et al.
Abstract
Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(\(\lambda\)) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter \(\lambda\). Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, \(Q\)-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called \(Q(\sigma)\) which unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, \(\sigma\), is introduced to allow the degree of sampling performed by the algorithm at each step d
Authors
(none)
Tags
Stats
Related papers
- Double Q(\(\sigma\)) And Q(\(\sigma, \lambda\)): Unifying Reinforcement Learning Control Algorithms (2017)0.00
- A Unified Approach For Multi-step Temporal-difference Learning With Eligibility Traces In Reinforcement Learning (2018)6.77
- Understanding Multi-step Deep Reinforcement Learning: A Systematic Study Of The DQN Target (2019)0.00
- A Distributional Analysis Of Sampling-based Reinforcement Learning Algorithms (2020)0.00
- Time-scale Separation In Q-learning: Extending Td(\(\triangle\)) For Action-value Function Decomposition (2024)0.00
- Iterated \(q\)-network: Beyond One-step Bellman Updates In Deep Reinforcement Learning (2024)0.00
- Approximating Two Value Functions Instead Of One: Towards Characterizing A New Family Of Deep Reinforcement Learning Algorithms (2019)0.00
- Tbq(\(\sigma\)): Improving Efficiency Of Trace Utilization For Off-policy Reinforcement Learning (2019)0.00