Learning In Restless Bandits Under Exogenous Global Markov Process
2021 Β· Tomer Gafni, Michal Yemini, Kobi Cohen
Abstract
We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where an unknown exogenous global Markov process governs the rewards distribution of each arm. Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule, which is non-identical among different arms. At each time, a player chooses an arm out of \(N\) arms to play, and receives a random reward from a finite set of reward states. The arms are restless, that is, their local state evolves regardless of the player's actions. Motivated by recent studies on related RMAB settings, the regret is defined as the reward loss with respect to a player that knows the dynamics of the problem, and plays at each time \(t\) the arm that maximizes the expected immediate value. The objective is to develop an arm-selection policy that minimizes the regret. To that end, we develop the Learning under Exogenous Markov Process (LEMP) algorithm. We analyze LEMP theore
Authors
(none)
Tags
Stats
Related papers
- Provably Efficient Reinforcement Learning For Adversarial Restless Multi-armed Bandits With Unknown Transitions And Bandit Feedback (2024)0.00
- Restless Bandit Problem With Rewards Generated By A Linear Gaussian Dynamical System (2024)0.00
- Towards A Pretrained Model For Restless Bandits Via Multi-arm Generalization (2023)0.00
- Multi-action Restless Bandits With Weakly Coupled Constraints: Simultaneous Learning And Control (2024)0.00
- Q-learning Lagrange Policies For Multi-action Restless Bandits (2021)8.35
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24
- Non-stationary Latent Auto-regressive Bandits (2024)0.00