Inverse Contextual Bandits Without Rewards: Learning From A Non-stationary Learner Via Suffix Imitation
2026 Β· Yuqi Kong, Xiao Zhang, Weiran Shen
Abstract
We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of \(\tilde O(1/\sqrt\{N\})\), matching the asymptotic efficie
Authors
(none)
Tags
Stats
Related papers
- Anytime-valid Off-policy Inference For Contextual Bandits (2022)2.26
- Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy (2018)0.00
- Contextual Bandits And Optimistically Universal Learning (2022)0.00
- Online Learning With Costly Features In Non-stationary Environments (2023)0.00
- A Bayesian Solution To The Imitation Gap (2024)0.00
- Feedback In Imitation Learning: The Three Regimes Of Covariate Shift (2021)0.00
- A New Bandit Setting Balancing Information From State Evolution And Corrupted Context (2020)0.00
- Bandit Social Learning: Exploration Under Myopic Behavior (2023)0.00