BigCodeBench (Elo) bigcodebench-elo Leaderboard
Elo-style pairwise rating across BigCodeBench tasks. Complementary to Pass@1 β captures relative strength when multiple models attempt the same prompts. Bigger differences in Elo correlate with bigger Pass@1 gaps. Β· Metric: Elo (higher is better)
| # | Model | Elo | Paper |
|---|---|---|---|
| 1 | DeepSeek-V3-Chat | 1216.89 | β |
| 2 | GPT-4o-2024-05-13 | 1216.72 | β |
| 3 | DeepSeek-V2-Chat (2024-06-28) | 1186.31 | β |
| 4 | DeepSeek-Coder-V2-Instruct | 1184.20 | β |
| 5 | Gemini-Exp-1114 | 1173.74 | β |
| 6 | Gemini-Exp-1206 | 1172.42 | β |
| 7 | Qwen2.5-Coder-32B-Instruct | 1168.91 | β |
| 8 | GPT-4-Turbo-2024-04-09 | 1162.95 | β |
| 9 | GPT-4o-2024-11-20 | 1156.35 | β |
| 10 | Claude-3.5-Sonnet-20240620 | 1146.48 | β |
| 11 | GPT-4-0613 | 1143.07 | β |
| 12 | Codestral-2501 | 1142.93 | β |
| 13 | Claude-3.5-Haiku-20241022 | 1142.85 | β |
| 14 | Gemini-2.0-Flash-Exp | 1142.47 | β |
| 15 | Llama-3.3-70B-Instruct | 1142.14 | β |
| 16 | GPT-4o-mini-2024-07-18 | 1141.20 | β |
| 17 | Athene-V2-Chat | 1140.81 | β |
| 18 | Claude-3-Opus-20240229 | 1132.72 | β |
| 19 | Athene-V2-Agent | 1128.42 | β |
| 20 | Hermes-2-Theta-Llama-3-70B | 1127.49 | β |
| 21 | Qwen2.5-72B-Instruct | 1125.66 | β |
| 22 | Gemini-Exp-1121 | 1123.33 | β |
| 23 | Gemini-1.5-Pro-API-0514 | 1123.08 | β |
| 24 | DeepSeek-V2.5-1210 | 1123.05 | β |
| 25 | Llama-3.1-70B-Instruct | 1122.56 | β |
| 26 | Phi-4 | 1119.78 | β |
| 27 | Claude-3.5-Sonnet-20241022 | 1112.66 | β |
| 28 | Gemini-1.5-Flash-API-0514 | 1105.38 | β |
| 29 | Llama-3-70B-Instruct | 1099.57 | β |
| 30 | Llama-3-70B-Synthia-v3.5 | 1096.57 | β |