Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment
2026 Β· Shipeng Li, Zhiqin Yang, Shikun Li, et al.
Abstract
arXiv:2506.11480v4 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RLVR post-training. To overcome the well-known response-length bias in gradient norms, we introduce the data learnability based on the success rate, which indicates the learning potential of each data point. Experiments across five reasoning benchmarks show that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. Specifically, it reduces data requirements by up to 1,000 data points with better performance (77.5%) than that on the full dat
Authors
(none)
Tags
Stats
Related papers
- Gradalign: Gradient-aligned Data Selection For LLM Reinforcement Learning (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Stabilizing Policy Gradients For Sample-efficient Reinforcement Learning In LLM Reasoning (2025)0.00
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- Whatever Remains Must Be True: Filtering Drives Reasoning In Llms, Shaping Diversity (2025)0.00
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00