Occupancy Information Ratio: Infinite-horizon, Information-directed, Parameterized Policy Search
2022 Β· Wesley A. Suttle, Alec Koppel, Ji Liu
Abstract
In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the average cost of a policy and the entropy of its induced state occupancy measure, enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. Specifically, we show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known. Since model knowledge is typically lacking in practice, we lay the foundations for model-free OIR policy search methods by es
Authors
(none)
Tags
Stats
Related papers
- Inverse Reinforcement Learning With Explicit Policy Estimates (2021)2.26
- OPIRL: Sample Efficient Off-policy Inverse Reinforcement Learning Via Distribution Matching (2021)0.00
- Actor-critic Policy Optimization In Partially Observable Multiagent Environments (2018)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Offline RL With No OOD Actions: In-sample Learning Via Implicit Value Regularization (2023)0.00
- Simplifying Model-based RL: Learning Representations, Latent-space Models, And Policies With One Objective (2022)0.00
- Efficiently Breaking The Curse Of Horizon In Off-policy Evaluation With Double Reinforcement Learning (2019)10.21
- Task-guided Inverse Reinforcement Learning Under Partial Information (2021)0.00