Target Networks And Over-parameterization Stabilize Off-policy Bootstrapping With Function Approximation
2024 Β· Fengdi Che, Chenjun Xiao, Jincheng Mei, et al.
Abstract
We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonst
Authors
(none)
Tags
Stats
Related papers
- A Generalized Bootstrap Target For Value-learning, Efficiently Combining Value And Feature Predictions (2022)0.00
- Breaking The Deadly Triad With A Target Network (2021)0.00
- Implicit Under-parameterization Inhibits Data-efficient Deep Reinforcement Learning (2020)0.00
- Statistical Bootstrapping For Uncertainty Estimation In Off-policy Evaluation (2020)0.00
- Minimax-optimal Off-policy Evaluation With Linear Function Approximation (2020)0.00
- Bootstrapping With Models: Confidence Intervals For Off-policy Evaluation (2016)9.23
- A Unifying View Of Linear Function Approximation In Off-policy RL Through Matrix Splitting And Preconditioning (2025)0.00
- Accelerated And Instance-optimal Policy Evaluation With Linear Function Approximation (2021)0.00