Actor-critic Is Implicitly Biased Towards High Entropy Optimal Policies
2021 Β· Yuzheng Hu, Ziwei Ji, Matus Telgarsky
Abstract
We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like \(\epsilon\)-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls
Authors
(none)
Tags
Stats
Related papers
- Single-timescale Actor-critic Provably Finds Globally Optimal Policy (2020)0.00
- Off-policy Actor-critic In An Ensemble: Achieving Maximum General Entropy And Effective Environment Exploration In Deep Reinforcement Learning (2019)0.00
- Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor (2018)0.00
- Greedy Actor-critic: A New Conditional Cross-entropy Method For Policy Improvement (2018)0.00
- Beyond The Policy Gradient Theorem For Efficient Policy Updates In Actor-critic Algorithms (2022)0.00
- Stochastic Actor-critic: Mitigating Overestimation Via Temporal Aleatoric Uncertainty (2026)0.00
- Global Optimality And Finite Sample Analysis Of Softmax Off-policy Actor Critic Under State Distribution Mismatch (2021)0.00
- Actor-critic Policy Optimization In Partially Observable Multiagent Environments (2018)0.00