Log-sum-exponential Estimator For Off-policy Evaluation And Learning
2025 Β· Armin Behnamnia, Gholamali Aminian, Alireza Aghaei, et al.
Abstract
Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret -- the performance gap between our LSE estimator and the optimal policy -- assuming bounded \((1+\epsilon)\)-th moment of weighted reward. Notably, we achieve a convergence rate of \(O(n^\{-\epsilon/(1+ \epsilon)\})\) for the regret bounds, where \(\ep
Authors
(none)
Tags
Stats
Related papers
- Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy (2018)0.00
- Logarithmic Smoothing For Pessimistic Off-policy Evaluation, Selection And Learning (2024)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- DOLCE: Decomposing Off-policy Evaluation/learning Into Lagged And Current Effects (2025)0.00
- Doubly Robust Interval Estimation For Optimal Policy Evaluation In Online Learning (2021)0.00
- More Efficient Off-policy Evaluation Through Regularized Targeted Learning (2019)0.00
- Adaptive Doubly Robust Estimator From Non-stationary Logging Policy Under A Convergence Of Average Probability (2021)0.00
- Logarithmic Smoothing For Adaptive Pac-bayesian Off-policy Learning (2025)0.00