JudgeBench

Emerging

5papers using it

2025first seen

JudgeBench: A Benchmark for Evaluating LLM-Based Judges 📃 [Paper] • 💻 [Github] • 🤗 [Dataset] • 🏆 [Leaderboard] JudgeBench is a benchmark aimed at evaluating LLM-based judges for objective correctness on challenging response pairs. For more information on how the response pairs are constructed, please see our paper.

🔎 Find this dataset

Papers using JudgeBench (5)

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks2026

IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models2026

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards2025

Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity2025

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments2025