Policy Optimization With Second-order Advantage Information
2018 Β· Jiajin Li, Baoxiang Wang
Abstract
Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.
Authors
(none)
Tags
Stats
Related papers
- Identifying Policy Gradient Subspaces (2024)0.00
- Marginal Policy Gradients: A Unified Family Of Estimators For Bounded Action Spaces With Applications (2018)0.00
- Distributional Policy Optimization: An Alternative Approach For Continuous Control (2019)0.00
- Proximal Policy Optimization With Continuous Bounded Action Space Via The Beta Distribution (2021)0.00
- ANO: A Principled Approach To Robust Policy Optimization (2026)0.00
- Action-depedent Control Variates For Policy Optimization Via Stein's Identity (2017)0.00
- Off-oab: Off-policy Policy Gradient Method With Optimal Action-dependent Baseline (2024)0.00
- Mitigating Suboptimality Of Deterministic Policy Gradients In Complex Q-functions (2024)0.00