MATH
Canonical43papers using it
2024first seen
The 'MATH' dataset is a benchmark that contains a collection of mathematical problems used to evaluate the performance of large language models in solving mathematical reasoning tasks.
Papers using MATH (43)
- HRM-Text: Efficient Pretraining Beyond ScalingRethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity AsymmetryHeteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization GainsHow Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling LawsWhen Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level ValidationWords & Weights: Streamlining Multi-Turn Interactions via Co-AdaptationConfidence Geometry Reveals Trace-Level Correctness in Large Language Model ReasoningFast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM InferenceOff-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical ReasoningLearning to Predict Future-Aligned Research Proposals with Language ModelsDiscovering Process-Outcome Credit in Multi-Step LLM ReasoningTowards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMsGroup-Aware Reinforcement Learning for Output Diversity in Large Language ModelsCan Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-ThoughtHow to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMsCLEAR: Error Analysis via LLM-as-a-Judge Made EasySIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo AugmentationExchange of Perspective Prompting Enhances Reasoning in Large Language ModelsrStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
ThinkingReasonFlux: Hierarchical LLM Reasoning via Scaling Thought TemplatesEvalTree: Profiling Language Model Weaknesses via Hierarchical
Capability TreesReasoning to Learn from Latent ThoughtsEntropy-Based Adaptive Weighting for Self-TrainingGenPRM: Scaling Test-Time Compute of Process Reward Models via
Generative ReasoningT1: Tool-integrated Self-verification for Test-time Compute Scaling in
Small Language ModelsSyzygy of Thoughts: Improving LLM CoT with the Minimal Free ResolutionM1: Towards Scalable Test-Time Compute with Mamba Reasoning ModelsSPC: Evolving Self-Play Critic via Adversarial Games for LLM ReasoningPutting the Value Back in RL: Better Test-Time Scaling by Unifying LLM
Reasoners With VerifiersSEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy
OptimizationWarm Up Before You Train: Unlocking General Reasoning in
Resource-Constrained SettingsInterleaved Reasoning for Large Language Models via Reinforcement
LearningDoes Math Reasoning Improve General LLM Capabilities? Understanding
Transferability of LLM ReasoningAryabhata: An exam-focused language model for JEE MathTest-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive
ExpertsDon't Waste Mistakes: Leveraging Negative RL-Groups via Confidence
ReweightingSkill-Targeted Adaptive TrainingShape of Thought: When Distribution Matters More than Correctness in Reasoning TasksRethinking Mixture-of-Agents: Is Mixing Different Large Language Models
Beneficial?Self-Training Elicits Concise Reasoning in Large Language ModelsDIVE: Diversified Iterative Self-ImprovementDipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasksCodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning