← all datasets

MMLU-Pro

Canonical

29papers using it

168,573HF downloads

485HF likes

2025first seen

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Ad

🤗 Hugging Face⚖ mit

Papers using MMLU-Pro (29)

K-Quantization and its Impact on Output Performance2026

NITP: Next Implicit Token Prediction for LLM Pre-training2026

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models2026

Benchmark Illusion: Disagreement Among Llms And Its Scientific Consequences2026

AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards2025

Group-Aware Reinforcement Learning for Output Diversity in Large Language Models2025

Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution2025

AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment2025

Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?2025

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD2025

Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models2025

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning2025

When an LLM is apprehensive about its answers -- and when its uncertainty is justified2025

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models2025

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection2025

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings2025

General-Reasoner: Advancing LLM Reasoning Across All Domains2025

Reinforcing General Reasoning without Verifiers2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors2025

Answer Matching Outperforms Multiple Choice for Language Model Evaluation2025

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation2025

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD2025

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives2025

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning2025

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection2025

ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models2025

INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling2025

Calibrating LLM Confidence by Probing Perturbed Representation Stability2025

Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?2025

MMLU-Pro — datasets — llm-papers