MBPP+ mbpp-plus Leaderboard
MBPP+ pass@1 (greedy decoding) from the EvalPlus leaderboard β the MBPP (Mostly Basic Python Problems) sanitized set augmented with far more rigorous automatically-generated tests. Covers both open and closed models under one consistent harness. Pass@1 is the fraction (0β100%) of problems whose single greedy generation passes all (plus) tests; it is NOT a count of problems solved. Β· Metric: Pass@1 (higher is better) Β· π’ Updated 2h ago
| # | Model | Pass@1 | Paper |
|---|---|---|---|
| 1 | O1 Preview (Sept 2024) | 80.20 | link |
| 2 | O1 Mini (Sept 2024) | 78.80 | link |
| 3 | Qwen2.5-Coder-32B-Instruct | 77.00 | link |
| 4 | DeepSeek-Coder-V2-Instruct | 75.10 | link |
| 5 | Gemini 1.5 Pro 002 | 74.60 | link |
| 6 | Claude Sonnet 3.5 (June 2024) | 74.30 | link |
| 7 | DeepSeek-V2.5 (Nov 2024) | 74.10 | link |
| 8 | claude-3-opus (Mar 2024) | 73.30 | link |
| 9 | GPT-4-Turbo (Nov 2023) | 73.30 | link |
| 10 | DeepSeek-V3 (Nov 2024) | 73.00 | link |
| 11 | GPT 4o (Aug 2024) | 72.20 | link |
| 12 | GPT 4o Mini (July 2024) | 72.20 | link |
| 13 | OpenCoder-8B-Instruct | 71.40 | link |
| 14 | DeepSeek-Coder-33B-instruct | 70.10 | link |
| 15 | GPT-3.5-Turbo (Nov 2023) | 69.70 | link |
| 16 | Artigenz-Coder-DS-6.7B | 69.60 | link |
| 17 | claude-3-sonnet (Mar 2024) | 69.30 | link |
| 18 | CodeQwen1.5-7B-Chat | 69.00 | link |
| 19 | Llama3-70B-instruct | 69.00 | link |
| 20 | Magicoder-S-DS-6.7B | 69.00 | link |
| 21 | claude-3-haiku (Mar 2024) | 68.80 | link |
| 22 | OpenCodeInterpreter-DS-33B | 68.50 | link |
| 23 | Gemini 1.5 Flash 002 | 67.50 | link |
| 24 | WhiteRabbitNeo-33B-v1 | 66.90 | link |
| 25 | OpenCodeInterpreter-DS-6.7B | 66.40 | link |
| 26 | DeepSeek-Coder-6.7B-instruct | 65.60 | link |
| 27 | Grok Beta | 65.60 | link |
| 28 | starcoder2-15b-instruct-v0.1 | 65.10 | link |
| 29 | XwinCoder-34B | 64.80 | link |
| 30 | starchat2-15b-v0.1 | 64.60 | link |