Zeroth-order Optimization Is Secretly Single-step Policy Optimization
2025 Β· Junbin Qiu, Zhengpeng Xie, Xiangda Yan, et al.
Abstract
Zeroth-Order Optimization (ZOO) provides powerful tools for optimizing functions where explicit gradients are unavailable or expensive to compute. However, the underlying mechanisms of popular ZOO methods, particularly those employing randomized finite differences, and their connection to other optimization paradigms like Reinforcement Learning (RL) are not fully elucidated. This paper establishes a fundamental and previously unrecognized connection: ZOO with finite differences is equivalent to a specific instance of single-step Policy Optimization (PO). We formally unveil that the implicitly smoothed objective function optimized by common ZOO algorithms is identical to a single-step PO objective. Furthermore, we show that widely used ZOO gradient estimators, are mathematically equivalent to the REINFORCE gradient estimator with a specific baseline function, revealing the variance-reducing mechanism in ZOO from a PO perspective.Built on this unified framework, we propose ZoAR (Zeroth-O
Authors
(none)
Tags
Stats
Related papers
- Learning Sampling Policy For Faster Derivative Free Optimization (2021)0.00
- Zeroth-order Supervised Policy Improvement (2020)0.00
- Ancestral Reinforcement Learning: Unifying Zeroth-order Optimization And Genetic Algorithms For Reinforcement Learning (2024)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- Simple Policy Optimization (2024)0.00
- ISOPO: Proximal Policy Gradients Without Pi-old (2025)0.00
- Zeroth-order Deterministic Policy Gradient (2020)0.00
- Proximal Policy Optimization Algorithms (2017)0.00