Maximum-entropy Exploration With Future State-action Visitation Measures
2026 Β· Adrien Bolland, Gaspard Lambrechts, Damien Ernst
Abstract
Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for sli
Authors
(none)
Tags
Stats
Related papers
- Off-policy Maximum Entropy RL With Future State And Action Visitation Measures (2024)0.00
- Maximum Entropy Exploration Without The Rollouts (2026)0.00
- Fast Rates For Maximum Entropy Exploration (2023)0.00
- Accelerating Reinforcement Learning With Value-conditional State Entropy Exploration (2023)0.00
- The Importance Of Non-markovianity In Maximum State Entropy Exploration (2022)0.00
- Provably Efficient Maximum Entropy Exploration (2018)0.00
- Task-agnostic Exploration Via Policy Gradient Of A Non-parametric State Entropy Estimate (2020)0.00
- R\'enyi State Entropy For Exploration Acceleration In Reinforcement Learning (2022)0.00