GSM8K

Emerging

32papers using it

2022first seen

The GSM-8K dataset is a benchmark that contains mathematical reasoning tasks used to evaluate the performance of language models on complex reasoning challenges.

🔎 Find this dataset

Papers using GSM8K (32)

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference2026

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks2025

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code2025

Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment2025

MathClean: A Benchmark for Synthetic Mathematical Data Cleaning2025

PAL: Program-aided Language Models2022 · 104 cites

Progressive-Hint Prompting Improves Reasoning in Large Language Models2023 · 33 cites

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models2023 · 28 cites

Orca-Math: Unlocking the potential of SLMs in Grade School Math2024 · 7 cites

Learning From Mistakes Makes LLM Better Reasoner2023 · 5 cites

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning2023 · 4 cites

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning2024 · 4 cites

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset2024 · 3 cites

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning2024 · 3 cites

Automatic Model Selection with Large Language Models for Reasoning2023 · 2 cites

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs2024 · 2 cites

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning2024 · 2 cites

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning2024 · 1 cites

PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning2024 · 1 cites

TinyGSM: achieving >80% on GSM8k with small language models2023 · 1 cites

Building Math Agents with Multi-Turn Iterative Preference Learning2024 · 1 cites

Explicit Knowledge Transfer for Weakly-Supervised Code Generation2022

Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning2023

AskIt: Unified Programming Interface for Programming with Large Language Models2023

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline2024

Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control2024

Can LLMs Reason in the Wild with Programs?2024

Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning2024

Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning2024

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning2024

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts2024

InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning2024