Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy
2018 Β· Yuan Xie, Boyi Liu, Qiang Liu, et al.
Abstract
When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given hist
Authors
(none)
Tags
Stats
Related papers
- Log-sum-exponential Estimator For Off-policy Evaluation And Learning (2025)0.00
- Logarithmic Smoothing For Pessimistic Off-policy Evaluation, Selection And Learning (2024)0.00
- Anytime-valid Off-policy Inference For Contextual Bandits (2022)2.26
- Doubly Robust Interval Estimation For Optimal Policy Evaluation In Online Learning (2021)0.00
- DOLCE: Decomposing Off-policy Evaluation/learning Into Lagged And Current Effects (2025)0.00
- Beyond Variance Reduction: Understanding The True Impact Of Baselines On Policy Optimization (2020)0.00
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- Bandit Social Learning: Exploration Under Myopic Behavior (2023)0.00