How Memory Architecture Affects Learning In A Simple POMDP: The Two-hypothesis Testing Problem
2021 Β· Mario Geiger, Christophe Eloy, Matthieu Wyart
Abstract
Reinforcement learning is generally difficult for partially observable Markov decision processes (POMDPs), which occurs when the agent's observation is partial or noisy. To seek good performance in POMDPs, one strategy is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in a two-hypothesis testing problem, akin to the two-arm bandit problem. We compare two extreme cases: (i) the random access memory where any transitions between \(M\) memory states are allowed and (ii) a fixed memory where the agent can access its last \(m\) actions and rewards. For (i), the probability \(q\) to play the worst arm is known to be exponentially small in \(M\) for the optimal policy. Our
Authors
(none)
Tags
Stats
Related papers
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00
- Sample-efficient Learning Of Pomdps With Multiple Observations In Hindsight (2023)0.00
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- Memoryless Policy Iteration For Episodic Pomdps (2025)0.00
- The Act Of Remembering: A Study In Partially Observable Reinforcement Learning (2020)0.00
- Reinforcement Learning In Pomdps With Memoryless Options And Option-observation Initiation Sets (2017)6.77
- Goal-oriented Inference Of Environment From Redundant Observations (2023)3.58
- Stable Hadamard Memory: Revitalizing Memory-augmented Agents For Reinforcement Learning (2024)0.00