Abstract

Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are known to be brittle toward hyperparameters as well as \cut\{being\}sample inefficient. Soft Actor Critic (SAC) proposes an off-policy deep actor critic algorithm within the maximum entropy RL framework which offers greater stability and empirical gains. The choice of policy distribution, a factored Gaussian, is motivated by \cut\{chosen due\}its easy re-parametrization rather than its modeling power. We introduce Normalizing Flow policies within the SAC framework that learn more expressive classes of policies than simple factored Gaussians. \cut\{We also present a series of stabilization tricks that enable effective training of these policies in the RL setting.\}We show empirically on continuous grid world tasks that our approach increases stability and is better suited to difficult exploration in sparse reward settings.

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyward2019improving

Related papers