Qwen-2.5-math-7B

Emerging

5papers using it

2025first seen

The 'Qwen-2.5-Math-7B' dataset/benchmark is used to evaluate reinforcement learning with verifiable rewards (RLVR) in training large language models on deterministic outcome reasoning tasks.

🔎 Find this dataset

Papers using Qwen-2.5-math-7B (5)

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning2026

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing2026

On the optimization dynamics of RLVR: Gradient gap and step size thresholds2025

DCPO: Dynamic Clipping Policy Optimization2025

Incentivizing LLMs to Self-Verify Their Answers2025