MATH

Canonical

43papers using it

2024first seen

The 'MATH' dataset is a benchmark that contains a collection of mathematical problems used to evaluate the performance of large language models in solving mathematical reasoning tasks.

🔎 Find this dataset

Papers using MATH (43)

HRM-Text: Efficient Pretraining Beyond Scaling2026

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry2026

Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains2026

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws2026

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation2026

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation2026

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning2026

Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning2026

Learning to Predict Future-Aligned Research Proposals with Language Models2026

Discovering Process-Outcome Credit in Multi-Step LLM Reasoning2026

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs2026

Group-Aware Reinforcement Learning for Output Diversity in Large Language Models2025

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought2025

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs2025

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy2025

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation2025

Exchange of Perspective Prompting Enhances Reasoning in Large Language Models2025

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking2025

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates2025

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees2025

Reasoning to Learn from Latent Thoughts2025

Entropy-Based Adaptive Weighting for Self-Training2025

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning2025

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models2025

Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution2025

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models2025

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning2025

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers2025

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization2025

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings2025

Interleaved Reasoning for Large Language Models via Reinforcement Learning2025

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning2025

Aryabhata: An exam-focused language model for JEE Math2025

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts2025

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting2025

Skill-Targeted Adaptive Training2025

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks2025

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?2025

Self-Training Elicits Concise Reasoning in Large Language Models2025

DIVE: Diversified Iterative Self-Improvement2025

Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks2024

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning2024