HumanEval humaneval Leaderboard
Auto-discovered from papers reporting HumanEval (pass@1). Β· Metric: pass@1 (higher is better)
| # | Model | pass@1 | Paper |
|---|---|---|---|
| 1 | Planning-Driven Programming: A Large Language Model Programming Workflow | 98.20 | β |
| 2 | SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution | 95.70 | β |
| 3 | CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | 95.10 | β |
| 4 | CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models | 95.10 | β |
| 5 | Poison with Style: A Practical Poisoning Attack on Code Large Language Models | 95.00 | β |
| 6 | Multi-task Code LLMs: Data Mix or Model Merge? | 92.70 | β |
| 7 | ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement | 87.20 | β |
| 8 | BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation | 83.50 | β |
| 9 | CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation | 77.80 | β |
| 10 | Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation | 70.12 | β |
| 11 | CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing | 63.00 | β |
| 12 | Modularization is Better: Effective Code Generation with Modular Prompting | 58.10 | β |
| 13 | Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach | 35.71 | β |
| 14 | Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality | 31.22 | β |
| 15 | Context-Augmented Code Generation Using Programming Knowledge Graphs | 20.00 | β |
| 16 | Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding | 17.10 | β |
| 17 | RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing | 5.50 | β |
| 18 | A Mixture of Linear Corrections Generates Secure Code | 2.10 | β |