← all datasets

JudgeBench

Emerging
5papers using it
1,587HF downloads
11HF likes
2025first seen

JudgeBench: A Benchmark for Evaluating LLM-Based Judges πŸ“ƒ [Paper] β€’ πŸ’» [Github] β€’ πŸ€— [Dataset] β€’ πŸ† [Leaderboard] JudgeBench is a benchmark aimed at evaluating LLM-based judges for objective correctness on challenging response pairs. For more information on how the response pairs are constructed, please see our paper.

Papers using JudgeBench (5)

JudgeBench β€” datasets β€” reinforcement-learning