XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning
2025 Β· Daniel Palenicek, Florian Vogt, Joe Watson, et al.
Abstract
Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the critic's Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC:
Authors
(none)
Tags
Stats
Related papers
- Crossq: Batch Normalization In Deep Reinforcement Learning For Greater Sample Efficiency And Simplicity (2019)0.00
- Improving Offline-to-online Reinforcement Learning With Q Conditioned State Entropy Exploration (2023)0.00
- Aggressive Q-learning With Ensembles: Achieving Both High Sample Efficiency And High Asymptotic Performance (2021)0.00
- An Information-theoretic Optimality Principle For Deep Reinforcement Learning (2017)0.00
- Spectral Normalisation For Deep Reinforcement Learning: An Optimisation Perspective (2021)0.00
- Optimality-based Analysis Of XCSF Compaction In Discrete Reinforcement Learning (2020)6.34
- Deep Q-networks For Accelerating The Training Of Deep Neural Networks (2016)0.00
- Quantum Natural Policy Gradients: Towards Sample-efficient Reinforcement Learning (2023)7.16