Multi-action Restless Bandits With Weakly Coupled Constraints: Simultaneous Learning And Control
2024 · Jing Fu, Bill Moran, José Niño-Mora
Abstract
We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant pr
Authors
(none)
Tags
Stats
Related papers
- Provably Efficient Reinforcement Learning For Adversarial Restless Multi-armed Bandits With Unknown Transitions And Bandit Feedback (2024)0.00
- A New Bandit Setting Balancing Information From State Evolution And Corrupted Context (2020)0.00
- Q-learning Lagrange Policies For Multi-action Restless Bandits (2021)8.35
- Learning In Restless Bandits Under Exogenous Global Markov Process (2021)6.34
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24
- Restless Bandit Problem With Rewards Generated By A Linear Gaussian Dynamical System (2024)0.00
- Bandit Social Learning: Exploration Under Myopic Behavior (2023)0.00