← all datasets

MMLU-Hard

Emerging

2papers using it

2026first seen

'MMLU-Hard' is a high-difficulty benchmark used to evaluate the performance of language models in understanding and reasoning through complex tasks.

🔎 Find this dataset

Papers using MMLU-Hard (2)

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size2026

The Cost Of Consensus: Isolated Self-correction Prevails Over Unguided Homogeneous Multi-agent Debate2026

MMLU-Hard — datasets — ai-agents