Complete Policy Regret Bounds For Tallying Bandits
2022 Β· Dhruv Malik, Yuanzhi Li, Aarti Singh
Abstract
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the *complete policy regret*, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the *tallying bandit*. This is an online learning setting with an \(m\)-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last \(m\) timesteps. For tallying bandit problems with \(K\) actions and time horizon \(T\), we provide an algorithm that w.h.p achieves a complete policy regret guarantee of \(\tilde\{\mathcal\{O\}\}(mK\sqrt\{T\})\), where t
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Best Of Both Worlds: Regret Minimization Versus Minimax Play (2025)0.00
- Adversarial Learning In Games With Bandit Feedback: Logarithmic Pure-strategy Maximin Regret (2026)0.00
- Rate-optimal Policy Optimization For Linear Markov Decision Processes (2023)0.00
- On-line Learning In Tree Mdps By Treating Policies As Bandit Arms (2026)0.00
- Learning In Markov Games With Adaptive Adversaries: Policy Regret, Fundamental Barriers, And Efficient Algorithms (2024)0.00