Policy Dispersion In Non-markovian Environment
2023 Β· Bohao Qu, Xiaofeng Cao, Jielong Yang, et al.
Abstract
Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we st
Authors
(none)
Tags
Stats
Related papers
- Learning Non-markovian Reward Models In Mdps (2020)0.00
- Intrinsically Motivated Hierarchical Policy Learning In Multi-objective Markov Decision Processes (2023)4.52
- Policy Learning For Robust Markov Decision Process With A Mismatched Generative Model (2022)0.00
- Diverse Policies Converge In Reward-free Markov Decision Processe (2023)0.00
- Configurable Markov Decision Processes (2018)0.00
- Efficient Policy Learning For Non-stationary Mdps Under Adversarial Manipulation (2019)0.00
- Entropic Regularization Of Markov Decision Processes (2019)6.77
- Multi-timescale Ensemble Q-learning For Markov Decision Process Policy Optimization (2024)6.34