Improved Algorithm For Adversarial Linear Mixture Mdps With Bandit Feedback And Unknown Transition
2024 Β· Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
Abstract
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an \(\widetilde\{O\}(d\sqrt\{HS^3K\} + \sqrt\{HSAK\})\) regret with high probability, where \(d\) is the dimension of feature mappings, \(S\) is the size of state space, \(A\) is the size of action space, \(H\) is the episode length and \(K\) is the number of episodes. Our result strictly improves the previous best-known \(\widetilde\{O\}(dS^2 \sqrt\{K\} + \sqrt\{HSAK\})\) result in Zhao et al. (2023a) since \(H \leq S\) holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored sp
Authors
(none)
Tags
Stats
Related papers
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Near-optimal Policy Optimization Algorithms For Learning Adversarial Linear Mixture Mdps (2021)0.00
- Online Learning In Mdps With Partially Adversarial Transitions And Losses (2026)0.00
- Online Learning In Mdps With Linear Function Approximation And Bandit Feedback (2020)0.00
- Near-optimal Regret For Adversarial MDP With Delayed Bandit Feedback (2022)0.00
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Efficient Policy Learning For Non-stationary Mdps Under Adversarial Manipulation (2019)0.00
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00