No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping
2025 Β· Thanh-Long V. Le, Myeongho Jeon, Kim Vu, et al.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce Reinforcement Learning with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over
Authors
(none)
Tags
Stats
Related papers
- Rethinking Entropy Interventions In RLVR: An Entropy Change Perspective (2026)0.00
- Prompt Replay: Speeding Up Grpo With On-policy Reuse Of High-signal Prompts (2026)0.00
- Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards (2025)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Rate Or Fate? Rlv\(^\varepsilon\)r: Reinforcement Learning With Verifiable Noisy Rewards (2026)0.00
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00