BigCodeBench (Instruct) bigcodebench-instruct Leaderboard
BigCodeBench Instruct subset β same 1,140 tasks but evaluated under instruction-following format. Tests whether instruction-tuned models can follow prompts to produce correct code. Β· Metric: Pass@1 (higher is better)
| # | Model | Pass@1 | Paper |
|---|---|---|---|
| 1 | GPT-4o-2024-05-13 | 51.10 | β |
| 2 | DeepSeek-V3 | 50.00 | β |
| 3 | Llama-4-Maverick | 49.70 | β |
| 4 | Quasar-Alpha | 49.60 | β |
| 5 | Gemini-Exp-1114 | 49.20 | β |
| 6 | Qwen2.5-Coder-32B-Instruct | 49.00 | β |
| 7 | DeepSeek-V2-Chat (2024-06-28) | 48.90 | β |
| 8 | GPT-4.1-Mini-2025-04-14 | 48.90 | β |
| 9 | DeepSeek-V2.5-1210 | 48.60 | β |
| 10 | DeepSeek-Coder-V2-Instruct | 48.20 | β |
| 11 | GPT-4-Turbo-2024-04-09 | 48.20 | β |
| 12 | Qwen2.5-Coder-14B-Instruct | 48.20 | β |
| 13 | GPT-4o-2024-11-20 | 48.00 | β |
| 14 | Athene-V2-Chat | 47.20 | β |
| 15 | Gemini-Exp-1206 | 47.00 | β |
| 16 | Llama-3.3-70B-Instruct | 46.90 | β |
| 17 | Claude-3.5-Sonnet-20240620 | 46.80 | β |
| 18 | Athene-V2-Agent | 46.20 | β |
| 19 | Claude-3.5-Haiku-20241022 | 46.10 | β |
| 20 | GPT-4o-mini-2024-07-18 | 46.10 | β |
| 21 | Llama-3.1-70B-Instruct | 46.10 | β |
| 22 | GPT-4-0613 | 46.00 | β |
| 23 | Gemini-2.0-Flash-Exp | 45.90 | β |
| 24 | Qwen2.5-72B-Instruct | 45.80 | β |
| 25 | Hermes-2-Theta-Llama-3-70B | 45.60 | β |
| 26 | Claude-3-Opus-20240229 | 45.50 | β |
| 27 | Phi-4 | 45.50 | β |
| 28 | Gemini-Exp-1121 | 45.40 | β |
| 29 | Mistral-Small-24B-Instruct-2501 | 45.30 | β |
| 30 | Sky-T1-32B-Flash | 45.10 | β |