Reward-punishment Reinforcement Learning With Maximum Entropy
2024 Β· Jiexin Wang, Eiji Uchibe
Abstract
We introduce the ``soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives. Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional ``max'' and ``min'' operators, where the goal is enhancing sample efficiency and robustness. We also address two unresolved issues from the previous Deep MaxPain method. Firstly, we investigate how the negated (``flipped'') pain-seeking sub-policy, derived from the punishment action value, collaborates with the ``min'' operator to effectively learn the punishment module and how softDMP's smooth learning operator provides insights into the ``flipping'' trick. Secondly, we tackle the challenge of data collection for learning the punishment module to mitigate inconsistencies arising from the involvement of the ``flipped'' sub-policy (pain-avoidance sub-policy) in the unified behavior pol
Authors
(none)
Tags
Stats
Related papers
- Soft Policy Gradient Method For Maximum Entropy Deep Reinforcement Learning (2019)10.85
- Do You Need The Entropy Reward (in Practice)? (2022)0.00
- Policy Optimization Reinforcement Learning With Entropy Regularization (2019)0.00
- Off-policy Maximum Entropy RL With Future State And Action Visitation Measures (2024)0.00
- A Diffusion Model Framework For Maximum Entropy Reinforcement Learning (2025)0.00
- Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor (2018)0.00
- Self Punishment And Reward Backfill For Deep Q-learning (2020)7.16
- Sample-efficient Reinforcement Learning With Maximum Entropy Mellowmax Episodic Control (2019)0.00