DISPO: Enhancing Training Efficiency And Stability In Reinforcement Learning For Large Language Model Mathematical Reasoning
2026 Β· Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, et al.
Abstract
Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weigh
Authors
(none)
Tags
Stats
Related papers
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- It's Not You, It's Clipping: A Soft Trust-region Via Probability Smoothing For LLM RL (2025)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- What's Behind Ppo's Collapse In Long-cot? Value Optimization Holds The Secret (2025)0.00
- Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models (2023)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Quantile Reward Policy Optimization: Alignment With Pointwise Regression And Exact Partition Functions (2025)0.00