GSM8K
Emerging53papers using it
896,385HF downloads
1,391HF likes
2024first seen
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2
🤗 Hugging Face⚖ mit
Papers using GSM8K (53)
- Stable Reinforcement Learning for Efficient ReasoningThink or Not? Selective Reasoning via Reinforcement Learning for Vision-Language ModelsReasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for
Large Language ModelsQeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMsOpen-Medical-R1: How to Choose Data for RLVR Training at Medicine DomainMOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMsEnhancing Multi-Step Reasoning Abilities of Language Models through
Direct Q-Function OptimizationDiscrete Tilt MatchingBoosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical ReasoningEntropy-Regularized Process Reward ModelBeyond KL Divergence: Policy Optimization With Flexible Bregman Divergences For LLM ReasoningESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-TuningThink Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient ReasoningTMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language ModelsUpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMsWhy GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive GradientsGroup-Aware Reinforcement Learning for Output Diversity in Large Language ModelsScRPO: From Errors to Insightsh1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement LearningGIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNAIt's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RLwd1: Weighted Policy Optimization for Reasoning in Diffusion Language ModelsEfficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art PerformanceRL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMsMaximizing Confidence Alone Improves ReasoningSegment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language ModelsSynthetic Data Generation & Multi-Step RL for Reasoning & Tool UseTraining Large Language Models to Reason via EM Policy GradientTapered Off-Policy REINFORCE: Stable and efficient reinforcement
learning for LLMsCoevolving with the Other You: Fine-Tuning LLM with Sequential
Cooperative Multi-Agent Reinforcement LearningQwen2.5-Math Technical Report: Toward Mathematical Expert Model via
Self-ImprovementRUMAD: Reinforcement-Unifying Multi-Agent DebateVinePPO: Refining Credit Assignment in RL Training of LLMsOffline Reinforcement Learning for LLM Multi-Step ReasoningCPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning
TasksSMART: Self-learning Meta-strategy Agent for Reasoning TasksAsynchronous RLHF: Faster and More Efficient Off-Policy RL for Language
ModelsBig-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement
Learning in Language ModelsTowards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language ModelsTutorGym: A Testbed for Evaluating AI Agents as Tutors and StudentsInpainting-Guided Policy Optimization for Diffusion Large Language ModelsReward Model Routing in AlignmentThe Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models TrainingSemantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement LearningDifferentiable Evolutionary Reinforcement LearningAGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model TrainingSD-E$^2$: Semantic Exploration for Reasoning Under Token BudgetsBeyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM ReasoningDon't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM AgentsSynthetic Data RL: Task Definition Is All You NeedRL for Reasoning by Adaptively Revealing RationalesMASPRM: Multi-Agent System Process Reward Model