MULEX: Disentangling Exploitation From Exploration In Deep RL
2019 Β· Lucas Beyer, Damien Vincent, Olivier Teboul, et al.
Abstract
An agent learning through interactions should balance its action selection process between probing the environment to discover new rewards and using the information acquired in the past to adopt useful behaviour. This trade-off is usually obtained by perturbing either the agent's actions (e.g., e-greedy or Gibbs sampling) or the agent's parameters (e.g., NoisyNet), or by modifying the reward it receives (e.g., exploration bonus, intrinsic motivation, or hand-shaped rewards). Here, we adopt a disruptive but simple and generic perspective, where we explicitly disentangle exploration and exploitation. Different losses are optimized in parallel, one of them coming from the true objective (maximizing cumulative rewards from the environment) and others being related to exploration. Every loss is used in turn to learn a policy that generates transitions, all shared in a single replay buffer. Off-policy methods are then applied to these transitions to optimize each loss. We showcase our approa
Authors
(none)
Tags
Stats
Related papers
- Exploitation Is All You Need... For Exploration (2025)0.00
- Decoupling Exploration And Exploitation For Meta-reinforcement Learning Without Sacrifices (2020)0.00
- First-explore, Then Exploit: Meta-learning To Solve Hard Exploration-exploitation Trade-offs (2023)0.00
- Efficient Reinforcement Learning Via Decoupling Exploration And Utilization (2023)2.56
- The Exploration-exploitation Dilemma Revisited: An Entropy Perspective (2024)0.00
- Learn The Ropes, Then Trust The Wins: Self-imitation With Progressive Exploration For Agentic Reinforcement Learning (2025)0.00
- Never Give Up: Learning Directed Exploration Strategies (2020)0.00
- Learning Off-policy With Model-based Intrinsic Motivation For Active Online Exploration (2024)0.00