Abstract

We study the setting of *performative reinforcement learning* where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parencite\{MTR23\} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to *linear Markov decision processes* which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a *performatively stable policy*. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal du

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers