SOAP-RL: Sequential Option Advantage Propagation For Reinforcement Learning In POMDP Environments
2024 Β· Shu Ishida, JoΓ£o F. Henriques
Abstract
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning In Pomdps With Memoryless Options And Option-observation Initiation Sets (2017)6.77
- Sequential Monte Carlo For Policy Optimization In Continuous Pomdps (2025)0.00
- Policy Optimization With Model-based Explorations (2018)5.84
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- Optimal Decision-making In Mixed-agent Partially Observable Stochastic Environments Via Reinforcement Learning (2019)0.00
- Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms (2025)0.00
- Experimental Results : Reinforcement Learning Of Pomdps Using Spectral Methods (2017)0.00
- Posterior Sampling-based Online Learning For Episodic Pomdps (2023)0.00