Scheduling Your LLM Reinforcement Learning With Reasoning Trees
2026 Β· Hong Wang, Zhezheng Hao, Jian Luo, et al.
Abstract
arXiv:2510.24832v2 Announce Type: replace Abstract: Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on s
Authors
(none)
Tags
Stats
Related papers
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Free Energy-driven Reinforcement Learning With Adaptive Advantage Shaping For Unsupervised Reasoning In Llms (2026)0.00
- Scaling Behaviors Of LLM Reinforcement Learning Post-training: An Empirical Study In Mathematical Reasoning (2025)0.00
- Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment (2026)0.00
- A Relative-budget Theory For Reinforcement Learning With Verifiable Rewards In Large Language Model Reasoning (2026)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00
- A Survey Of Reinforcement Learning For Large Language Models Under Data Scarcity: Challenges And Solutions (2026)0.00