← all datasets

MMLU

Canonical

82papers using it

480,483HF downloads

769HF likes

2024first seen

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The

🤗 Hugging Face⚖ mit

Papers using MMLU (82)

HRM-Text: Efficient Pretraining Beyond Scaling2026

The Price of Format: Diversity Collapse in LLMs2025 · 2 cites

A general tensor-structured compression scheme for efficient large language models2026

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages2026

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation2026

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning2026

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models2026

Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories2026

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?2026

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs2026

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency2026

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning2026

Residual Stream Analysis of Overfitting And Structural Disruptions2026

SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing2026

How do LLMs Compute Verbal Confidence2026

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning2026

Leveraging KV Similarity for Online Structured Pruning in LLMs2025

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems2025

Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance2025

Dr.LLM: Dynamic Layer Routing in LLMs2025

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information2025

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance2025

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs2025

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features2025

Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection2025

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs2025

Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders2025

GEM: Empowering LLM for both Embedding Generation and Language Understanding2025

Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models2025

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment2025

TPTT: Transforming Pretrained Transformers into Titans2025

SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition2025

Control LLM: Controlled Evolution for Intelligence Retention in LLM2025

Humanity's Last Exam2025

LM2: Large Memory Models2025

Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models2025

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers2025

YourBench: Easy Custom Evaluation Sets for Everyone2025

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection2025

Void in Language Models2025

B-score: Detecting biases in large language models using response history2025

Interleaved Reasoning for Large Language Models via Reinforcement Learning2025

ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs2025

TPTT: Transforming Pretrained Transformer into Titans2025

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate2025

Lizard: An Efficient Linearization Framework for Large Language Models2025

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection2025

Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance2025

A Single Character can Make or Break Your LLM Evals2025

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives2025

Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs2025

Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework2025

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets2025

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection2025

Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models2025

Dual Decomposition of Weights and Singular Value Low Rank Adaptation2025

Calibrating LLM Confidence by Probing Perturbed Representation Stability2025

Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst2025

Efficient Data Selection at Scale via Influence Distillation2025

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment2025

Large Language Models Could Be Rote Learners2025

None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering2025

Order Independence With Finetuning2025

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs2025

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?2025

FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy2025

Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon2025

Leveraging Uncertainty Estimation for Efficient LLM Routing2025

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception2025

Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks2025

ORI: O Routing Intelligence2025

Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data2025

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?2024 · 2 cites

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments2024 · 2 cites

LoLCATs: On Low-Rank Linearizing of Large Language Models2024 · 1 cites

BenTo: Benchmark Task Reduction with In-Context Transferability2024

Adaptive Segment-level Reward: Bridging the Gap Between Action and Reward Space in Alignment2024

TODO: Enhancing LLM Alignment with Ternary Preferences2024

Reasoning Robustness of LLMs to Adversarial Typographical Errors2024

Llama 3 Meets MoE: Efficient Upcycling2024

MMLU — datasets — llm-papers