Abstract

We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The *reaction* core incorporates new observations with input from the slow core to produce the agent's policy; the *perception* core accesses only short-term observations and informs the slow core; lastly, the *prediction* core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting *Perception-Prediction-Reaction* (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DM

Authors

(none)

Tags

  • Multi-Agent

Stats

Related papers