Learning Sampling Policy For Faster Derivative Free Optimization
2021 Β· Zhou Zhai, Bin Gu, Heng Huang
Abstract
Zeroth-order (ZO, also known as derivative-free) methods, which estimate the gradient only by two function evaluations, have attracted much attention recently because of its broad applications in machine learning community. The two function evaluations are normally generated with random perturbations from standard Gaussian distribution. To speed up ZO methods, many methods, such as variance reduced stochastic ZO gradients and learning an adaptive Gaussian distribution, have recently been proposed to reduce the variances of ZO gradients. However, it is still an open problem whether there is a space to further improve the convergence of ZO methods. To explore this problem, in this paper, we propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. To find the optimal policy, an actor-critic RL algorithm called deep deterministic policy gradient (DDPG) with two neu
Authors
(none)
Tags
Stats
Related papers
- Zeroth-order Optimization Is Secretly Single-step Policy Optimization (2025)0.00
- Zeroth-order Supervised Policy Improvement (2020)0.00
- Zeroth-order Deterministic Policy Gradient (2020)0.00
- Sample Dropout: A Simple Yet Effective Variance Reduction Technique In Deep Policy Optimization (2023)0.00
- On-policy Policy Gradient Reinforcement Learning Without On-policy Sampling (2023)0.00
- Low-switching Policy Gradient With Exploration Via Online Sensitivity Sampling (2023)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- Ancestral Reinforcement Learning: Unifying Zeroth-order Optimization And Genetic Algorithms For Reinforcement Learning (2024)0.00