HumanEval humaneval Leaderboard

#	Model	pass@1	Paper
1	Planning-Driven Programming: A Large Language Model Programming Workflow	98.20	—
2	SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution	95.70	—
3	CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging	95.10	—
4	CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models	95.10	—
5	Poison with Style: A Practical Poisoning Attack on Code Large Language Models	95.00	—
6	Multi-task Code LLMs: Data Mix or Model Merge?	92.70	—
7	ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement	87.20	—
8	BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation	83.50	—
9	CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation	77.80	—
10	Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation	70.12	—
11	CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing	63.00	—
12	Modularization is Better: Effective Code Generation with Modular Prompting	58.10	—
13	Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach	35.71	—
14	Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality	31.22	—
15	Context-Augmented Code Generation Using Programming Knowledge Graphs	20.00	—
16	Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding	17.10	—
17	RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing	5.50	—
18	A Mixture of Linear Corrections Generates Secure Code	2.10	—

HumanEval humaneval Leaderboard