JudgeBench
Emerging4papers using it
1,587HF downloads
11HF likes
2025first seen
JudgeBench: A Benchmark for Evaluating LLM-Based Judges π [Paper] β’ π» [Github] β’ π€ [Dataset] β’ π [Leaderboard] JudgeBench is a benchmark aimed at evaluating LLM-based judges for objective correctness on challenging response pairs. For more information on how the response pairs are constructed, please see our paper.
π€ Hugging Faceβ mit
Papers using JudgeBench (4)
- Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudgeHelpSteer3-Preference: Open Human-Annotated Preference Data across
Diverse Tasks and LanguagesHelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and LanguagesLeveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating
LLM Judgments