Online Learning With Erd\h{o}s-r\'enyi Side-observation Graphs

Abstract

arXiv:2604.25271v1 Announce Type: cross Abstract: We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability \(r\), independently of each other and the action of the learner. We propose two algorithms that work for different ranges of \(r\). We show that after \(T\) rounds in a bandit problem with \(N\) arms, the expected regret of our first algorithm is \(O(\sqrt\{(T /r) log N \})\) whenever \(r\ge(log T)/(2N)\), while our second algorithm achieves a regret of \(O(\sqrt\{(T/r) log (N+T)\})\) for smaller values of \(r\). We also give a quick estimation procedure that decides the range of~\(r\). All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~\(r\).

Abstract

Related papers