HumanEval humaneval-2 Leaderboard
Auto-discovered from papers reporting HumanEval (accuracy). Β· Metric: accuracy (higher is better)
| # | Model | accuracy | Paper |
|---|---|---|---|
| 1 | Multi-Programming Language Ensemble for Code Generation in Large Language Model | 96.25 | β |
| 2 | Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach | 90.20 | β |
| 3 | CoCoNUT: Structural Code Understanding does not fall out of a tree | 47.00 | β |
| 4 | Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks | 23.79 | β |
| 5 | From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | 18.90 | β |
| 6 | Fixing Function-Level Code Generation Errors for Foundation Large Language Models | 7.50 | β |
| 7 | FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization system | 6.10 | β |
| 8 | Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation | 5.00 | β |
| 9 | Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection | 2.40 | β |