Refined Regret For Adversarial Mdps With Linear Function Approximation
2023 Β· Yan Dai, Haipeng Luo, Chen-Yu Wei, et al.
Abstract
We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over \(K\) episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order \(\tilde\{\mathcal O\}(K^\{2/3\})\) (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to \(\tilde\{\mathcal O\}(\sqrt K)\) in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the f
Authors
(none)
Tags
Stats
Related papers
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Improved Regret For Efficient Online Reinforcement Learning With Linear Function Approximation (2023)0.00
- First-order Regret In Reinforcement Learning With Linear Function Approximation: A Robust Estimation Approach (2021)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- Improved Regret Bound And Experience Replay In Regularized Policy Iteration (2021)0.00
- Nearly Minimax Optimal Reinforcement Learning For Linear Markov Decision Processes (2022)0.00
- Online Learning In Mdps With Linear Function Approximation And Bandit Feedback (2020)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58