MBPP+ mbpp-plus Leaderboard

MBPP+ pass@1 (greedy decoding) from the EvalPlus leaderboard — the MBPP (Mostly Basic Python Problems) sanitized set augmented with far more rigorous automatically-generated tests. Covers both open and closed models under one consistent harness. Pass@1 is the fraction (0–100%) of problems whose single greedy generation passes all (plus) tests; it is NOT a count of problems solved. · Metric: Pass@1 (higher is better) · 🟢 Updated 2h ago

Source ↗

#	Model	Pass@1	Paper
1	O1 Preview (Sept 2024)	80.20	link
2	O1 Mini (Sept 2024)	78.80	link
3	Qwen2.5-Coder-32B-Instruct	77.00	link
4	DeepSeek-Coder-V2-Instruct	75.10	link
5	Gemini 1.5 Pro 002	74.60	link
6	Claude Sonnet 3.5 (June 2024)	74.30	link
7	DeepSeek-V2.5 (Nov 2024)	74.10	link
8	claude-3-opus (Mar 2024)	73.30	link
9	GPT-4-Turbo (Nov 2023)	73.30	link
10	DeepSeek-V3 (Nov 2024)	73.00	link
11	GPT 4o (Aug 2024)	72.20	link
12	GPT 4o Mini (July 2024)	72.20	link
13	OpenCoder-8B-Instruct	71.40	link
14	DeepSeek-Coder-33B-instruct	70.10	link
15	GPT-3.5-Turbo (Nov 2023)	69.70	link
16	Artigenz-Coder-DS-6.7B	69.60	link
17	claude-3-sonnet (Mar 2024)	69.30	link
18	CodeQwen1.5-7B-Chat	69.00	link
19	Llama3-70B-instruct	69.00	link
20	Magicoder-S-DS-6.7B	69.00	link
21	claude-3-haiku (Mar 2024)	68.80	link
22	OpenCodeInterpreter-DS-33B	68.50	link
23	Gemini 1.5 Flash 002	67.50	link
24	WhiteRabbitNeo-33B-v1	66.90	link
25	OpenCodeInterpreter-DS-6.7B	66.40	link
26	DeepSeek-Coder-6.7B-instruct	65.60	link
27	Grok Beta	65.60	link
28	starcoder2-15b-instruct-v0.1	65.10	link
29	XwinCoder-34B	64.80	link
30	starchat2-15b-v0.1	64.60	link