First-order Regret In Reinforcement Learning With Linear Function Approximation: A Robust Estimation Approach
2021 Β· Andrew Wagenmaker, Yifang Chen, Max Simchowitz, et al.
Abstract
Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as \(\widetilde\{\mathcal\{O\}\}(\sqrt\{d^3 H^3 \cdot V_1^\star \cdot K\} + d^\{3.5\}H^3log K )\) in reinforcement learning with large state spaces, namely the linear MDP setting. Here \(V_1^\star\) is the value of the optimal policy and \(K\) is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
Authors
(none)
Tags
Stats
Related papers
- Improved Regret For Efficient Online Reinforcement Learning With Linear Function Approximation (2023)0.00
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Nearly Minimax Optimal Reinforcement Learning For Linear Markov Decision Processes (2022)0.00
- Value Function Approximations Via Kernel Embeddings For No-regret Reinforcement Learning (2020)0.00
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Prior-dependent Analysis Of Posterior Sampling Reinforcement Learning With Function Approximation (2024)0.00
- Foundations Of Safe Online Reinforcement Learning In The Linear Quadratic Regulator: \(\sqrt{t}\)-regret (2025)0.00
- Logarithmic Regret For Episodic Continuous-time Linear-quadratic Reinforcement Learning Over A Finite-time Horizon (2020)7.81