HumanEval+ humaneval-plus Leaderboard

HumanEval+ pass@1 (greedy decoding) from the EvalPlus leaderboard — the original 164 HumanEval problems augmented with ~80x more rigorous, automatically-generated tests that expose correctness bugs vanilla HumanEval misses. Covers both open and closed (GPT / Claude / Gemini / o1) models under one consistent harness. Pass@1 is the fraction (0–100%) of problems whose single greedy generation passes all (plus) tests; it is NOT a count of problems solved. · Metric: Pass@1 (higher is better) · 🟢 Updated 2h ago

Source ↗

#	Model	Pass@1	Paper
1	O1 Mini (Sept 2024)	89.00	link
2	O1 Preview (Sept 2024)	89.00	link
3	GPT 4o (Aug 2024)	87.20	link
4	Qwen2.5-Coder-32B-Instruct	87.20	link
5	DeepSeek-V3 (Nov 2024)	86.60	link
6	GPT-4-Turbo (April 2024)	86.60	link
7	DeepSeek-V2.5 (Nov 2024)	83.50	link
8	GPT 4o Mini (July 2024)	83.50	link
9	DeepSeek-Coder-V2-Instruct	82.30	link
10	Claude Sonnet 3.5 (June 2024)	81.70	link
11	GPT-4-Turbo (Nov 2023)	81.70	link
12	Grok Beta	80.50	link
13	Gemini 1.5 Pro 002	79.30	link
14	GPT-4 (May 2023)	79.30	link
15	CodeQwen1.5-7B-Chat	78.70	link
16	claude-3-opus (Mar 2024)	77.40	link
17	OpenCoder-8B-Instruct	77.40	link
18	Gemini 1.5 Flash 002	75.60	link
19	DeepSeek-Coder-33B-instruct	75.00	link
20	Codestral-22B-v0.1	73.80	link
21	OpenCodeInterpreter-DS-33B	73.80	link
22	WizardCoder-33B-V1.1	73.20	link
23	Artigenz-Coder-DS-6.7B	72.60	link
24	Llama3-70B-instruct	72.00	link
25	Mixtral-8x22B-Instruct-v0.1	72.00	link
26	OpenCodeInterpreter-DS-6.7B	72.00	link
27	speechless-codellama-34B-v2.0	72.00	link
28	DeepSeek-Coder-6.7B-instruct	71.30	link
29	DeepSeek-Coder-7B-instruct-v1.5	71.30	link
30	Magicoder-S-DS-6.7B	71.30	link