MATH-500
Emerging53papers using it
140,618HF downloads
317HF likes
2025first seen
Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits
Papers using MATH-500 (53)
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data ContaminationStable Reinforcement Learning for Efficient ReasoningNemotron-CrossThink: Scaling Self-Learning beyond Math ReasoningSqueeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language ModelKimi k1.5: Scaling Reinforcement Learning with LLMsQeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMsHow Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVRRL with Learnable Textual Feedback: A Bilevel ApproachMOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMsReinforcement Learning for Reasoning in Large Language Models with One Training ExampleDiscrete Tilt MatchingThink Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient ReasoningThickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM ReasoningBeyond Correctness: Learning Robust Reasoning via TransferLatent Poincar\'e Shaping for Agentic Reinforcement LearningLong Chain-of-Thought Compression via Fine-Grained Group Policy OptimizationPRPO: Aligning Process Reward with Outcome Reward in Policy OptimizationR$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM ReasoningTraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM ReasoningScRPO: From Errors to InsightsMasked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable RewardsGRPO-$\lambda$: Credit Assignment improves LLM Reasoningh1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement LearningMATH-Beyond: A Benchmark for RL to Expand Beyond the Base ModelReasoning with Sampling: Your Base Model is Smarter Than You ThinkSPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative RolloutsEnhancing Math Reasoning in Small-sized LLMs via Preview Difficulty-Aware Interventionwd1: Weighted Policy Optimization for Reasoning in Diffusion Language ModelsConfidence Is All You Need: Few-Shot RL Fine-Tuning of Language ModelsNot All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement LearningMaximizing Confidence Alone Improves ReasoningControlling Large Language Model with Latent ActionsOpen-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base ModelOn the Emergence of Thinking in LLMs I: Searching for the Right
IntuitionDianJin-R1: Evaluating and Enhancing Financial Reasoning in Large
Language ModelsExploring the Limit of Outcome Reward for Learning Mathematical
ReasoningTowards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language ModelsProcess Reward Models That ThinkSPC: Evolving Self-Play Critic via Adversarial Games for LLM ReasoningSpurious Rewards: Rethinking Training Signals in RLVR$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative DraftsInpainting-Guided Policy Optimization for Diffusion Large Language ModelsFrom Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process SupervisionCan LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM ReasoningPrompt Augmentation Scales up GRPO Training on Mathematical ReasoningPrioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language ModelsAmortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language ModelsTool Verification for Test-Time Reinforcement LearningSortedRL: Accelerating RL Training for LLMs through Online Length-Aware SchedulingLLM Reasoning with Process Rewards for Outcome-Guided StepsPerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward AlignmentSampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte CarloOptimal Policy Minimum Bayesian Risk