Success Conditioning As Policy Improvement: The Optimization Problem Solved By Imitating Success
2026 Β· Daniel Russo
Abstract
A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a \(\chi^2\) divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade p
Authors
(none)
Tags
Stats
Related papers
- Reward-conditioned Policies (2019)0.00
- Simple Policy Optimization (2024)0.00
- Imitating Past Successes Can Be Very Suboptimal (2022)0.00
- Absolute Policy Optimization (2023)0.00
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00
- Post-convergence Sim-to-real Policy Transfer: A Principled Alternative To Cherry-picking (2025)0.00
- Near-future Policy Optimization (2026)0.00
- An Analytical Update Rule For General Policy Optimization (2021)0.00