Statistically Efficient Variance Reduction With Double Policy Estimation For Off-policy Evaluation In Sequence-modeled Reinforcement Learning
2023 Β· Hanhan Zhou, Tian Lan, Vaneet Aggarwal
Abstract
Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Offline Reinforcement Learning Via Double Variance Reduction (2021)0.00
- Doubly Optimal Policy Evaluation For Reinforcement Learning (2024)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Near-optimal Provable Uniform Convergence In Offline Policy Evaluation For Reinforcement Learning (2020)0.00
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00
- Off-policy Evaluation In Doubly Inhomogeneous Environments (2023)7.16