Whatever Remains Must Be True: Filtering Drives Reasoning In Llms, Shaping Diversity
2025 · Germán Kruszewski, Pierre Erbacher, Jos Rozen, et al.
Abstract
Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the \(\alpha\)-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art per
Authors
(none)
Tags
Stats
Related papers
- Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity (2026)0.00
- Power Distribution Bridges Sampling, Self-reward RL, And Self-distillation (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- Free Energy-driven Reinforcement Learning With Adaptive Advantage Shaping For Unsupervised Reasoning In Llms (2026)0.00
- A Unified Framework For Rethinking Policy Divergence Measures In GRPO (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Reducing Belief Deviation In Reinforcement Learning For Active Reasoning (2025)0.00
- Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment (2026)0.00