Trajectory Data Suffices For Statistically Efficient Learning In Offline RL With Linear \(q^\pi\)-realizability And Concentrability
2024 · Volodymyr Tkachuk, Gellért Weisz, Csaba Szepesvári
Abstract
We consider offline reinforcement learning (RL) in \(H\)-horizon Markov decision processes (MDPs) under the linear \(q^\pi\)-realizability assumption, where the action-value function of every policy is linear with respect to a given \(d\)-dimensional feature function. The hope in this setting is that learning a good policy will be possible without requiring a sample size that scales with the number of states in the MDP. Foster et al. [2021] have shown this to be impossible even under \(\textit\{concentrability\}\), a data coverage assumption where a coefficient \(C_\text\{conc\}\) bounds the extent to which the state-action distribution of any policy can veer off the data distribution. However, the data in this previous work was in the form of a sequence of individual transitions. This leaves open the question of whether the negative result mentioned could be overcome if the data was composed of sequences of full trajectories. In this work we answer this question positively by proving
Authors
(none)
Tags
Stats
Related papers
- Offline Reinforcement Learning: Role Of State Aggregation And Trajectory Data (2024)0.00
- Online RL In Linearly \(q^\pi\)-realizable Mdps Is As Easy As In Linear Mdps If You Learn What To Ignore (2023)0.00
- Offline Reinforcement Learning With Realizability And Single-policy Concentrability (2022)0.00
- Offline Safe Reinforcement Learning Using Trajectory Classification (2024)0.00
- Harnessing Mixed Offline Reinforcement Learning Datasets Via Trajectory Weighting (2023)0.00
- Sample-efficient Reinforcement Learning Is Feasible For Linearly Realizable Mdps With Limited Revisiting (2021)0.00
- What Are The Statistical Limits Of Offline RL With Linear Function Approximation? (2020)0.00
- Provably Efficient Offline Reinforcement Learning With Trajectory-wise Reward (2022)0.00