Expected Return Causes Outcome-level Mode Collapse In Reinforcement Learning And How To Fix It With Inverse Probability Scaling
2026 Β· Abhijeet Sinha, Sundari Elango, Dianbo Liu
Abstract
Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling,
Authors
(none)
Tags
Stats
Related papers
- Beyond Expected Return: Accounting For Policy Reproducibility When Evaluating Reinforcement Learning Algorithms (2023)3.58
- Moments Matter:stabilizing Policy Optimization Using Return Distributions (2026)0.00
- Model-agnostic Solutions For Deep Reinforcement Learning In Non-ergodic Contexts (2026)0.00
- The Perils Of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret (2024)0.00
- Analyzing And Bridging The Gap Between Maximizing Total Reward And Discounted Reward In Deep Reinforcement Learning (2024)0.00
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- Imitating Past Successes Can Be Very Suboptimal (2022)0.00
- Dense And Diverse Goal Coverage In Multi Goal Reinforcement Learning (2025)0.00