Optimize Wider, Not Deeper: Consensus Aggregation For Policy Optimization
2026 Β· Zelal Su, Mustafaoglu, Sungyoung Lee, et al.
Abstract
Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: \(K\) PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter
Authors
(none)
Tags
Stats
Related papers
- Simple Policy Optimization (2024)0.00
- Truly Proximal Policy Optimization (2019)0.00
- Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization (2018)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Neural Proximal/trust Region Policy Optimization Attains Globally Optimal Policy (2019)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00
- Cim-ppo:proximal Policy Optimization With Liu-correntropy Induced Metric (2021)0.00
- Proximal Policy Optimization Via Enhanced Exploration Efficiency (2020)13.70