Mirror Descent Actor Critic Via Bounded Advantage Learning
2025 Β· Ryo Iwaki
Abstract
Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entropy-regularized methods does not surpass that of a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-density terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advanta
Authors
(none)
Tags
Stats
Related papers
- Divergence-regularized Multi-agent Actor-critic (2021)0.00
- Mutual-information Regularization In Markov Decision Processes And Actor-critic Learning (2019)0.00
- ACE : Off-policy Actor-critic With Causality-aware Entropy Regularization (2024)0.00
- Actor-critic Is Implicitly Biased Towards High Entropy Optimal Policies (2021)0.00
- MARL With General Utilities Via Decentralized Shadow Reward Actor-critic (2021)0.00
- Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor (2018)0.00
- Distributional Soft Actor-critic: Off-policy Reinforcement Learning For Addressing Value Estimation Errors (2020)17.77
- Mirror Descent Policy Optimisation For Robust Constrained Markov Decision Processes (2025)0.00