On-line Learning In Tree Mdps By Treating Policies As Bandit Arms
2026 Β· Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, et al.
Abstract
arXiv:2605.04979v1 Announce Type: cross Abstract: A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state \(s_\{1\}\), in which every state is reachable from \(s_\{1\}\) through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc\{Lucb\} and \textsc\{Ucb\} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample comp
Authors
(none)
Tags
Stats
Related papers
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Efficient Policy Learning For Non-stationary Mdps Under Adversarial Manipulation (2019)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Multi-action Restless Bandits With Weakly Coupled Constraints: Simultaneous Learning And Control (2024)0.00
- Online Learning In Mdps With Linear Function Approximation And Bandit Feedback (2020)0.00
- Complete Policy Regret Bounds For Tallying Bandits (2022)0.00
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- Decision Making In Non-stationary Environments With Policy-augmented Monte Carlo Tree Search (2022)0.00