JudgeBench
Emerging5papers using it
1,587HF downloads
11HF likes
2025first seen
JudgeBench: A Benchmark for Evaluating LLM-Based Judges π [Paper] β’ π» [Github] β’ π€ [Dataset] β’ π [Leaderboard] JudgeBench is a benchmark aimed at evaluating LLM-based judges for objective correctness on challenging response pairs. For more information on how the response pairs are constructed, please see our paper.
π€ Hugging Faceβ mit
Papers using JudgeBench (5)
- IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward ModelsRLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable RewardsLeveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating
LLM JudgmentsAct-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective AmbiguityRethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks