A Unifying View Of Linear Function Approximation In Off-policy RL Through Matrix Splitting And Preconditioning
2025 Β· Zechen Wu, Amy Greenwald, Ronald Parr
Abstract
In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving the same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connection
Authors
(none)
Tags
Stats
Related papers
- The Role Of Lookahead And Approximate Policy Evaluation In Reinforcement Learning With Linear Value Function Approximation (2021)0.00
- Variance-aware Off-policy Evaluation With Linear Function Approximation (2021)0.00
- Off-policy Fitted Q-evaluation With Differentiable Function Approximators: Z-estimation And Inference Theory (2022)0.00
- Minimax-optimal Off-policy Evaluation With Linear Function Approximation (2020)0.00
- Accelerated And Instance-optimal Policy Evaluation With Linear Function Approximation (2021)0.00
- A Finite Time Analysis Of Temporal Difference Learning With Linear Function Approximation (2018)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Provably Efficient Reinforcement Learning With Linear Function Approximation (2019)11.76