Policy Gradient Algorithms With Monte Carlo Tree Learning For Non-markov Decision Processes
2022 Β· Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, et al.
Abstract
Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and conv
Authors
(none)
Tags
Stats
Related papers
- Policy Gradient Search: Online Planning And Expert Iteration Without Search Trees (2019)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- Learning Deterministic Policies With Policy Gradients In Constrained Markov Decision Processes (2025)0.00
- Softtreemax: Policy Gradient With Tree Search (2022)0.00
- Smoothing Policies And Safe Policy Gradients (2019)7.50
- Learning Policies From Self-play With Policy Gradients And MCTS Value Estimates (2019)0.00
- Mixed Policy Gradient: Off-policy Reinforcement Learning Driven Jointly By Data And Model (2021)0.00