Gradalign: Gradient-aligned Data Selection For LLM Reinforcement Learning
2026 Β· Ningyuan Yang, Weihua Du, Weiwei Sun, et al.
Abstract
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outp
Authors
(none)
Tags
Stats
Related papers
- Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment (2026)0.00
- Stabilizing Policy Gradients For Sample-efficient Reinforcement Learning In LLM Reasoning (2025)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- OBLR-PO: A Theoretical Framework For Stable Reinforcement Learning (2025)0.00
- Guiding Reinforcement Learning Using Uncertainty-aware Large Language Models (2024)0.00
- Efficient Differentially Private Fine-tuning Of Llms Via Reinforcement Learning (2025)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00