Model-based Reinforcement Learning With Double Oracle Efficiency In Policy Optimization And Offline Estimation

Abstract

arXiv:2605.00393v1 Announce Type: new Abstract: Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal \(\tilde\{O\}(\sqrt\{T\})\) regret bound while requiring only \(O(Hloglog T)\) calls to both the offline statistical estimation and planning oracles when \(T\) is known and \(O(Hlog T)\) call

Model-based Reinforcement Learning With Double Oracle Efficiency In Policy Optimization And Offline Estimation

Abstract

Authors

Tags

Stats

Related papers