Concave Statistical Utility Maximization Bandits Via Influence-function Gradients
2026 Β· Matias Carrasco, Alejandro Cholaquidis
Abstract
arXiv:2604.22140v2 Announce Type: replace-cross Abstract: We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concav
Authors
(none)
Tags
Stats
Related papers
- Unified Framework Of Distributional Regret In Multi-armed Bandits And Reinforcement Learning (2026)0.00
- Anti-concentrated Confidence Bonuses For Scalable Exploration (2021)0.00
- Meta-learning Bandit Policies By Gradient Ascent (2020)0.00
- Revealing Graph Bandits For Maximizing Local Influence (2026)0.00
- Beyond Variance Reduction: Understanding The True Impact Of Baselines On Policy Optimization (2020)0.00
- Trading Off Rewards And Errors In Multi-armed Bandits (2026)0.00
- Stochastic Gradient Succeeds For Bandits (2024)0.00
- Approximate Information Maximization For Bandit Games (2023)0.00