Careful At Estimation And Bold At Exploration
2023 Β· Xing Chen, Yijun Liu, Zhaogeng Liu, et al.
Abstract
Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two d
Authors
(none)
Tags
Stats
Related papers
- Sampling Efficient Deep Reinforcement Learning Through Preference-guided Stochastic Exploration (2022)8.09
- Exploration Conscious Reinforcement Learning Revisited (2018)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Curious Explorer: A Provable Exploration Strategy In Policy Learning (2021)0.00
- The Uncertainty Bellman Equation And Exploration (2017)0.00
- Uncertainty Quantification And Exploration For Reinforcement Learning (2019)6.77
- Extremum-seeking Action Selection For Accelerating Policy Optimization (2024)0.00
- Centralized Cooperative Exploration Policy For Continuous Control Tasks (2023)0.00