GSM8K
Canonical104papers using it
2024first seen
The 'GSM8K' dataset is a benchmark that contains math problems designed to evaluate the performance of multimodal large language models (MLLMs) in understanding and solving mathematical tasks.
Papers using GSM8K (104)
- HRM-Text: Efficient Pretraining Beyond ScalingSeek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent SpaceRethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity AsymmetryOn the Limits of Layer Pruning for Generative Reasoning in Large Language ModelsDistributional Energy-Based Models for Uncertainty-Aware Structured LLM ReasoningREFLECTOR: Internalizing Step-wise Reflection against Indirect JailbreakHow Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling LawsThe Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-SymbolicBridging Draft Policy Misalignment: Group Tree Optimization for Speculative DecodingThe Price of Format: Diversity Collapse in LLMsWhen Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level ValidationExploring LLM Reasoning Through Controlled Prompt VariationsStrategic Over-Parameterization for Generalizable Low-Rank AdaptationConfidence Geometry Reveals Trace-Level Correctness in Large Language Model ReasoningRoll Out and Roll Back: Diffusion LLMs are Their Own Efficiency TeachersPrompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDAEpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMsReading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMsMerlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive PromptingFast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM InferenceDoes Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency TradeoffsThe Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language ModelsTraining Language Models via Neural Cellular AutomataOff-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical ReasoningBenchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMsThink in Sentences: Explicit Sentence Boundaries Enhance Language Model's CapabilitiesAtManRL: Towards Faithful Reasoning via Differentiable Attention SaliencyNexus: Same Pretraining Loss, Better Downstream Generalization via Common MinimaResidual Stream Analysis of Overfitting And Structural DisruptionsSCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge EditingEfficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge DistillationGuided Collaboration in Heterogeneous LLM-Based Multi-Agent Systems via Entropy-Based Understanding Assessment and Experience RetrievalROAST: Rollout-based On-distribution Activation Steering TechniqueFIT: Defying Catastrophic Forgetting in Continual LLM UnlearningUnforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual LearningGroup-Aware Reinforcement Learning for Output Diversity in Large Language ModelsUncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM SystemsFine-Tuning on Noisy Instructions: Effects on Generalization and PerformancePrototype-Based Dynamic Steering for Large Language ModelsDr.LLM: Dynamic Layer Routing in LLMsA Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-SpaceVis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought ReasoningFrom Implicit Exploration to Structured Reasoning: Leveraging Guideline and Refinement for LLMsLLM-JEPA: Large Language Models Meet Joint Embedding Predictive ArchitecturesThinking Inside the Mask: In-Place Prompting in Diffusion LLMsCAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive TasksP3: Prompts Promote PromptingText-to-LoRA: Instant Transformer AdaptionEnhancing Reasoning Capabilities of Small Language Models with Blueprints and Prompt Template SearchFast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length PenaltyLearning to Insert [PAUSE] Tokens for Better ReasoningChunkKV: Semantic-Preserving KV Cache Compression for Efficient
Long-Context LLM InferenceMathematical Reasoning in Large Language Models: Assessing Logical and
Arithmetic Errors across Wide Numerical RangesSIFT: Grounding LLM Reasoning in Contexts via StickersFINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through
Reflective Puzzle SolvingLost in Cultural Translation: Do LLMs Struggle with Math Across Cultural
Contexts?Entropy-Based Adaptive Weighting for Self-TrainingSample, Don't Search: Rethinking Test-Time Alignment for Language ModelsSyzygy of Thoughts: Improving LLM CoT with the Minimal Free ResolutionAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong
Pretraining Data SelectionSeek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient
in Latent SpaceThinkless: LLM Learns When to ThinkEquivPruner: Boosting Efficiency and Quality in LLM-Based Search via
Action PruningLLaDA 1.5: Variance-Reduced Preference Optimization for Large Language
Diffusion ModelsLearning a Continue-Thinking Token for Enhanced Test-Time ScalingFractional Reasoning via Latent Steering Vectors Improves Inference Time
ComputeCommVQ: Commutative Vector Quantization for KV Cache CompressionTemporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via
Past-FutureAryabhata: An exam-focused language model for JEE MathDiffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion
ForcingInpainting-Guided Policy Optimization for Diffusion Large Language
ModelsSocratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolutiondParallel: Learnable Parallel Decoding for dLLMsFine-Tuning on Noisy Instructions: Effects on Generalization and
PerformanceTest-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive
ExpertsQeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning
for LLMsOpenSIR: Open-Ended Self-Improving ReasonerLoPA: Scaling dLLM Inference via Lookahead Parallel DecodingShape of Thought: When Distribution Matters More than Correctness in Reasoning TasksAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data SelectionData Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context LearningDual Decomposition of Weights and Singular Value Low Rank AdaptationSelf-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning CatalystEfficient Data Selection at Scale via Influence DistillationAdaptive Rectification Sampling for Test-Time Compute ScalingReasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for
Large Language ModelsRule-Guided Feedback: Enhancing Reasoning by Enforcing Rule Adherence in
Large Language ModelsLeveraging Uncertainty Estimation for Efficient LLM RoutingTokenSkip: Controllable Chain-of-Thought Compression in LLMsDistill Not Only Data but Also Rewards: Can Smaller Language Models
Surpass Larger Ones?Self-Training Elicits Concise Reasoning in Large Language ModelsDIVE: Diversified Iterative Self-ImprovementEvolutionary Pre-Prompt Optimization for Mathematical ReasoningDictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language ModelsUnderstanding Chain-of-Thought in LLMs through Information TheoryRethinking Data Synthesis: A Teacher Model Training Recipe with
InterpretationAdaptive Segment-level Reward: Bridging the Gap Between Action and
Reward Space in AlignmentLanguage Models are Hidden Reasoners: Unlocking Latent Reasoning
Capabilities via Self-RewardingReasoning Robustness of LLMs to Adversarial Typographical ErrorsLearning to Reason via Self-Iterative Process Feedback for Small
Language ModelsGReaTer: Gradients over Reasoning Makes Smaller Language Models Strong
Prompt OptimizersAsk-Before-Detection: Identifying and Mitigating Conformity Bias in
LLM-Powered Error Detector for Math Word Problem SolutionsLLM2: Let Large Language Models Harness System 2 ReasoningCodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning