MMLU
Canonical82papers using it
480,483HF downloads
769HF likes
2024first seen
Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The
π€ Hugging Faceβ mit
Papers using MMLU (82)
- HRM-Text: Efficient Pretraining Beyond ScalingThe Price of Format: Diversity Collapse in LLMsA general tensor-structured compression scheme for efficient large language modelsIndic-TunedLens: Interpreting Multilingual Models in Indian LanguagesSame Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM EvaluationConfidence Geometry Reveals Trace-Level Correctness in Large Language Model ReasoningDataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language ModelsLocas: Your Models are Principled Initializers of Locally-Supported Parametric MemoriesLie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMsAtManRL: Towards Faithful Reasoning via Differentiable Attention SaliencyLearning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought ReasoningResidual Stream Analysis of Overfitting And Structural DisruptionsSCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge EditingHow do LLMs Compute Verbal ConfidenceFIT: Defying Catastrophic Forgetting in Continual LLM UnlearningLeveraging KV Similarity for Online Structured Pruning in LLMsUncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM SystemsFine-Tuning on Noisy Instructions: Effects on Generalization and PerformanceDr.LLM: Dynamic Layer Routing in LLMsBeyond Majority Voting: LLM Aggregation by Leveraging Higher-Order InformationEAPO: Enhancing Policy Optimization with On-Demand Expert AssistanceOn Robustness and Reliability of Benchmark-Based Evaluation of LLMsCorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder FeaturesThinking Inside the Mask: In-Place Prompting in Diffusion LLMsTurning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety InjectionHow to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMsUncovering Cross-Linguistic Disparities in LLMs using Sparse AutoencodersGEM: Empowering LLM for both Embedding Generation and Language UnderstandingToken Constraint Decoding Improves Robustness on Question Answering for Large Language ModelsFrom Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety AlignmentTPTT: Transforming Pretrained Transformers into TitansSELT: Self-Evaluation Tree Search for LLMs with Task DecompositionControl LLM: Controlled Evolution for Intelligence Retention in LLMHumanity's Last ExamLM2: Large Memory ModelsSelective Self-to-Supervised Fine-Tuning for Generalization in Large
Language ModelsLLM-Microscope: Uncovering the Hidden Role of Punctuation in Context
Memory of TransformersYourBench: Easy Custom Evaluation Sets for EveryoneAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong
Pretraining Data SelectionVoid in Language ModelsB-score: Detecting biases in large language models using response
historyInterleaved Reasoning for Large Language Models via Reinforcement
LearningProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning
in LLMsTPTT: Transforming Pretrained Transformer into TitansGrowing Transformers: Modular Composition and Layer-wise Expansion on a
Frozen SubstrateLizard: An Efficient Linearization Framework for Large Language ModelsCorrSteer: Steering Improves Task Performance and Safety in LLMs through
Correlation-based Sparse Autoencoder Feature SelectionTurning the Spell Around: Lightweight Alignment Amplification via
Rank-One Safety InjectionFine-Tuning on Noisy Instructions: Effects on Generalization and
PerformanceA Single Character can Make or Break Your LLM EvalsThe Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple,
Self-Contained DirectivesParrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMsSay It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing FrameworkA Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute BudgetsAttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data SelectionDecentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsDual Decomposition of Weights and Singular Value Low Rank AdaptationCalibrating LLM Confidence by Probing Perturbed Representation StabilitySelf-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning CatalystEfficient Data Selection at Scale via Influence DistillationMore is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety AlignmentLarge Language Models Could Be Rote LearnersNone of the Above, Less of the Right: Parallel Patterns between Humans
and LLMs on Multi-Choice Questions AnsweringOrder Independence With FinetuningShifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMsRethinking Mixture-of-Agents: Is Mixing Different Large Language Models
Beneficial?FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining StrategyForget What You Know about LLMs Evaluations -- LLMs are Like a ChameleonLeveraging Uncertainty Estimation for Efficient LLM RoutingTowards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary PerceptionEvaluating Expert Contributions in a MoE LLM for Quiz-Based TasksORI: O Routing IntelligencePreference Curriculum: LLMs Should Always Be Pretrained on Their
Preferred DataThe Vulnerability of Language Model Benchmarks: Do They Accurately
Reflect True LLM Performance?Bridging the Gap: Enhancing LLM Performance for Low-Resource African
Languages with New Benchmarks, Fine-Tuning, and Cultural AdjustmentsLoLCATs: On Low-Rank Linearizing of Large Language ModelsBenTo: Benchmark Task Reduction with In-Context TransferabilityAdaptive Segment-level Reward: Bridging the Gap Between Action and
Reward Space in AlignmentTODO: Enhancing LLM Alignment with Ternary PreferencesReasoning Robustness of LLMs to Adversarial Typographical ErrorsLlama 3 Meets MoE: Efficient Upcycling