MMLU-Pro
Canonical29papers using it
168,573HF downloads
485HF likes
2025first seen
MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Ad
🤗 Hugging Face⚖ mit
Papers using MMLU-Pro (29)
- K-Quantization and its Impact on Output PerformanceNITP: Next Implicit Token Prediction for LLM Pre-trainingD-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language ModelsBenchmark Illusion: Disagreement Among Llms And Its Scientific ConsequencesAWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning RewardsGroup-Aware Reinforcement Learning for Output Diversity in Large Language ModelsLet the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability DistributionAutoBench: Automating LLM Evaluation through Reciprocal Peer AssessmentCan Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PDToken Constraint Decoding Improves Robustness on Question Answering for Large Language ModelsOctoTools: An Agentic Framework with Extensible Tools for Complex
ReasoningWhen an LLM is apprehensive about its answers -- and when its
uncertainty is justifiedT1: Tool-integrated Self-verification for Test-time Compute Scaling in
Small Language ModelsAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong
Pretraining Data SelectionWarm Up Before You Train: Unlocking General Reasoning in
Resource-Constrained SettingsGeneral-Reasoner: Advancing LLM Reasoning Across All DomainsReinforcing General Reasoning without VerifiersDyePack: Provably Flagging Test Set Contamination in LLMs Using
BackdoorsAnswer Matching Outperforms Multiple Choice for Language Model
EvaluationFrom KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for
LLM EvaluationPersuasion Dynamics in LLMs: Investigating Robustness and Adaptability
in Knowledge and Safety with DuET-PDThe Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple,
Self-Contained DirectivesCan LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM ReasoningAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data SelectionReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language ModelsINFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge ProfilingCalibrating LLM Confidence by Probing Perturbed Representation StabilityDistill Not Only Data but Also Rewards: Can Smaller Language Models
Surpass Larger Ones?