RewardBench (v1) rewardbench-v1 Leaderboard
RewardBench v1 β the original reward-model benchmark scoring how well a model picks the preferred response across Chat, Chat Hard, Safety, and Reasoning categories. Score is the overall weighted accuracy. Β· Metric: Score (higher is better)
| # | Model | Score | Paper |
|---|---|---|---|
| 1 | infly/INF-ORM-Llama3.1-70B | 95.11 | link |
| 2 | ShikaiChen/LDL-Reward-Gemma-2-27B-v0.1 | 94.99 | link |
| 3 | nicolinho/QRM-Gemma-2-27B | 94.44 | link |
| 4 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | 94.26 | link |
| 5 | nvidia/Llama-3.1-Nemotron-70B-Reward | 94.11 | link |
| 6 | Skywork/Skywork-Reward-Gemma-2-27B | 93.80 | link |
| 7 | SF-Foundation/TextEval-Llama3.1-70B | 93.48 | link |
| 8 | meta-metrics/MetaMetrics-RM-v1.0 | 93.42 | link |
| 9 | Skywork/Skywork-Critic-Llama-3.1-70B | 93.31 | link |
| 10 | nicolinho/QRM-Llama3.1-8B-v2 | 93.14 | link |
| 11 | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | 93.13 | link |
| 12 | nicolinho/QRM-Llama3.1-8B | 93.06 | link |
| 13 | LxzGordon/URM-LLaMa-3.1-8B | 92.94 | link |
| 14 | Salesforce/SFR-LLaMa-3.1-70B-Judge-r | 92.72 | link |
| 15 | R-I-S-E/RISE-Judge-Qwen2.5-32B | 92.66 | link |
| 16 | Skywork/Skywork-Reward-Llama-3.1-8B | 92.52 | link |
| 17 | AtlaAI/Selene-1 | 92.41 | link |
| 18 | general-preference/GPM-Llama-3.1-8B | 92.24 | link |
| 19 | nvidia/Nemotron-4-340B-Reward | 92.00 | link |
| 20 | Ray2333/GRM-Llama3-8B-rewardmodel-ft | 91.54 | link |
| 21 | nicolinho/QRM-Llama3-8B | 91.10 | link |
| 22 | SF-Foundation/TextEval-OffsetBias-12B | 91.05 | link |
| 23 | Ray2333/GRM-llama3.2-3B-rewardmodel-ft | 90.92 | link |
| 24 | Salesforce/SFR-nemo-12B-Judge-r | 90.27 | link |
| 25 | allenai/Llama-3.1-70B-Instruct-RM-RB2 | 90.21 | link |
| 26 | internlm/internlm2-20b-reward | 90.16 | link |
| 27 | Skywork/Skywork-VL-Reward-7B | 90.07 | link |
| 28 | facebook/Self-taught-evaluator-llama3.1-70B | 90.01 | link |
| 29 | LxzGordon/URM-LLaMa-3-8B | 89.91 | link |
| 30 | NCSOFT/Llama-3-OffsetBias-RM-8B | 89.42 | link |