HumanEval+ humaneval-plus Leaderboard
HumanEval+ pass@1 (greedy decoding) from the EvalPlus leaderboard β the original 164 HumanEval problems augmented with ~80x more rigorous, automatically-generated tests that expose correctness bugs vanilla HumanEval misses. Covers both open and closed (GPT / Claude / Gemini / o1) models under one consistent harness. Pass@1 is the fraction (0β100%) of problems whose single greedy generation passes all (plus) tests; it is NOT a count of problems solved. Β· Metric: Pass@1 (higher is better) Β· π’ Updated 2h ago
| # | Model | Pass@1 | Paper |
|---|---|---|---|
| 1 | O1 Mini (Sept 2024) | 89.00 | link |
| 2 | O1 Preview (Sept 2024) | 89.00 | link |
| 3 | GPT 4o (Aug 2024) | 87.20 | link |
| 4 | Qwen2.5-Coder-32B-Instruct | 87.20 | link |
| 5 | DeepSeek-V3 (Nov 2024) | 86.60 | link |
| 6 | GPT-4-Turbo (April 2024) | 86.60 | link |
| 7 | DeepSeek-V2.5 (Nov 2024) | 83.50 | link |
| 8 | GPT 4o Mini (July 2024) | 83.50 | link |
| 9 | DeepSeek-Coder-V2-Instruct | 82.30 | link |
| 10 | Claude Sonnet 3.5 (June 2024) | 81.70 | link |
| 11 | GPT-4-Turbo (Nov 2023) | 81.70 | link |
| 12 | Grok Beta | 80.50 | link |
| 13 | Gemini 1.5 Pro 002 | 79.30 | link |
| 14 | GPT-4 (May 2023) | 79.30 | link |
| 15 | CodeQwen1.5-7B-Chat | 78.70 | link |
| 16 | claude-3-opus (Mar 2024) | 77.40 | link |
| 17 | OpenCoder-8B-Instruct | 77.40 | link |
| 18 | Gemini 1.5 Flash 002 | 75.60 | link |
| 19 | DeepSeek-Coder-33B-instruct | 75.00 | link |
| 20 | Codestral-22B-v0.1 | 73.80 | link |
| 21 | OpenCodeInterpreter-DS-33B | 73.80 | link |
| 22 | WizardCoder-33B-V1.1 | 73.20 | link |
| 23 | Artigenz-Coder-DS-6.7B | 72.60 | link |
| 24 | Llama3-70B-instruct | 72.00 | link |
| 25 | Mixtral-8x22B-Instruct-v0.1 | 72.00 | link |
| 26 | OpenCodeInterpreter-DS-6.7B | 72.00 | link |
| 27 | speechless-codellama-34B-v2.0 | 72.00 | link |
| 28 | DeepSeek-Coder-6.7B-instruct | 71.30 | link |
| 29 | DeepSeek-Coder-7B-instruct-v1.5 | 71.30 | link |
| 30 | Magicoder-S-DS-6.7B | 71.30 | link |