← all datasets

Arena-Hard

Emerging
24papers using it
66HF downloads
1HF likes
2024first seen

The 'Arena-Hard' dataset is a benchmark used to evaluate the performance of reasoning models by assessing their ability to generate outputs that can deceive other LLM judges.

Papers using Arena-Hard (24)

Arena-Hard β€” datasets β€” llm-papers