Improved Regret Bound And Experience Replay In Regularized Policy Iteration
2021 Β· Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, et al.
Abstract
In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from \(O(T^\{3/4\})\) to \(O(\sqrt\{T\})\) under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability \(O(\sqrt\{T\})\) regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices,
Authors
(none)
Tags
Stats
Related papers
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Exploration-enhanced POLITEX (2019)0.00
- Fast Rates For The Regret Of Offline Reinforcement Learning (2021)2.26
- Adaptive Approximate Policy Iteration (2020)0.00
- Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes (2025)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- Square-root Regret Bounds For Continuous-time Episodic Markov Decision Processes (2022)2.26
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00