Zeroth-order Supervised Policy Improvement
2020 Β· Hao Sun, Ziping Xu, Yuhang Song, et al.
Abstract
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function \(Q\) globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-m
Authors
(none)
Tags
Stats
Related papers
- Zeroth-order Deterministic Policy Gradient (2020)0.00
- Learning Sampling Policy For Faster Derivative Free Optimization (2021)0.00
- Zeroth-order Optimization Is Secretly Single-step Policy Optimization (2025)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Efficiently Escaping Saddle Points For Policy Optimization (2023)0.00
- Relative Entropy Pathwise Policy Optimization (2025)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00