HumanEval

Canonical

242papers using it

2021first seen

164 hand-written Python programming problems with unit tests, evaluating functional code generation from docstrings (pass@k).

🔎 Find this dataset

Papers using HumanEval (200)

A Survey on Large Language Models for Code Generation2024 · 59 cites

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging2025 · 7 cites

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding2025 · 6 cites

Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models2025 · 6 cites

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding2025 · 5 cites

Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?2022 · 6 cites

Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding2025 · 3 cites

Regression Accumulation in Multi-Turn LLM Programming Conversations2026

Type-Constrained Code Generation with Language Models2025 · 3 cites

Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?2025 · 3 cites

Poison with Style: A Practical Poisoning Attack on Code Large Language Models2026

Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points2025 · 3 cites

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach2025 · 2 cites

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality2025 · 2 cites

A Taxonomy of Inefficiencies in LLM-Generated Python Code2025 · 2 cites

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation2025 · 2 cites

Large Language Model Guided Self-Debugging Code Generation2025 · 2 cites

On the Effectiveness of Large Language Models in Domain-Specific Code Generation2023 · 3 cites

QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks2025 · 2 cites

Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities2025 · 2 cites

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging2024 · 4 cites

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation2024 · 2 cites

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence2025 · 1 cites

Automated Repair of Ambiguous Problem Descriptions for LLM-Based Code Generation2025 · 1 cites

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs2025 · 1 cites

ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement2025 · 1 cites

Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach2025 · 1 cites

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging2025 · 1 cites

Dafny as Verification-Aware Intermediate Language for Code Generation2025 · 1 cites

Fixing Function-Level Code Generation Errors for Foundation Large Language Models2024 · 1 cites

CodeMirage: Hallucinations in Code Generated by Large Language Models2024 · 1 cites

Where Is Self-admitted Code Generated by Large Language Models on GitHub?2024 · 1 cites

Improved Large Language Diffusion Models2026

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation2026

An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning2026

Prompt Optimization for LLM Code Generation via Reinforcement Learning2026

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation2026

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution2026

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices2026

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning2026

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning2026

CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language2026

LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops2026

Think Anywhere in Code Generation2026

TextBFGS: A Case-Based Reasoning Approach to Code Optimization via Error-Operator Retrieval2026

IntentCoding: Amplifying User Intent in Code Generation2026

OmniCode: A Benchmark for Evaluating Software Engineering Agents2026

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation2026

Towards Green AI: Decoding the Energy of LLM Inference in Software Development2026

Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting2026

Bootstrapping Code Translation with Weighted Multilanguage Exploration2026

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study2026

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation2026

Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback2026

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs2026

Multi-task Code LLMs: Data Mix or Model Merge?2026

Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation2026

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference2026

Generating Verifiable Chain of Thoughts from Exection-Traces2025

CODE ACROSTIC: Robust Watermarking for Code Generation2025

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models2025

PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents2025

How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks2025

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts2025

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths2025

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation2025

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models2025

AdapTrack: Constrained Decoding without Distorting LLM's Output Intent2025

Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases2025

TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation2025

SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin2025

Evaluating SAP Joule for Code Generation2025

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models2025

CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement2025

Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness2025

Alignment with Fill-In-the-Middle for Enhancing Code Generation2025

STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning2025

Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks2025

Rethinking Verification for LLM Code Generation: From Generation to Testing2025

A Mixture of Linear Corrections Generates Secure Code2025

Turning the Tide: Repository-based Code Reflection2025

CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing2025

MemoCoder: Automated Function Synthesis using LLM-Supported Agents2025

Rethinking Verification for LLM Code Generation: From Generation to Testing2025

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection2025

When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions2025

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation2025

Prompt engineering and framework: implementation to increase code reliability based guideline for LLMs2025

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees2025

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol2025

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees2025

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence2025

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code2025

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts2025

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks2025

Rethinking Repetition Problems of LLMs in Code Generation2025

Evaluating Large Language Models for Code Review2025

Self-Correcting Code Generation Using Small Language Models2025

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks2025

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions2025

Modularization is Better: Effective Code Generation with Modular Prompting2025

Can LLMs Enable Verification in Mainstream Programming?2025

ACECODER: Acing Coder RL via Automated Test-Case Synthesis2025

Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment2025

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance2025

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities2025

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models2025

Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement2025

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval2025

Pragmatic Reasoning improves LLM Code Generation2025

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks2025

CoCoNUT: Structural Code Understanding does not fall out of a tree2025

CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation2025

Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs2025

Planning-Driven Programming: A Large Language Model Programming Workflow2024

Context-Augmented Code Generation Using Programming Knowledge Graphs2024

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'2024

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution2024

EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization2024 · 1 cites

ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation2024

LLM4Decompile: Decompiling Binary Code with Large Language Models2024 · 28 cites

Evaluating Large Language Models Trained on Code2021 · 1,451 cites

Code Llama: Open Foundation Models for Code2023 · 414 cites

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis2022 · 242 cites

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X2023 · 220 cites

StarCoder: may the source be with you!2023 · 197 cites

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation2023 · 176 cites

WizardCoder: Empowering Code Large Language Models with Evol-Instruct2023 · 84 cites

Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT2023 · 70 cites

CodeT: Code Generation with Generated Tests2022 · 65 cites

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges2024 · 56 cites

Self-Edit: Fault-Aware Code Editor for Code Generation2023 · 44 cites

The Stack: 3 TB of permissively licensed source code2022 · 40 cites

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step2024 · 40 cites

CYCLE: Learning to Self-Refine the Code Generation2024 · 29 cites

Magicoder: Empowering Code Generation with OSS-Instruct2023 · 28 cites

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code2024 · 28 cites

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation2023 · 25 cites

Interactive Code Generation via Test-Driven User-Intent Formalization2022 · 24 cites

MapCoder: Multi-Agent Code Generation for Competitive Problem Solving2024 · 23 cites

On Evaluating the Efficiency of Source Code Generated by LLMs2024 · 21 cites

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation2023 · 20 cites

SelfEvolve: A Code Evolution Framework via Large Language Models2023 · 18 cites

Structured Chain-of-Thought Prompting for Code Generation2023 · 17 cites

OctoPack: Instruction Tuning Code Large Language Models2023 · 17 cites

Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions2022 · 14 cites

Fault-Aware Neural Code Rankers2022 · 13 cites

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation2022 · 13 cites

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology2024 · 12 cites

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion2023 · 11 cites

Using Large Language Models to Generate JUnit Tests: An Empirical Study2023 · 10 cites

SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents2024 · 10 cites

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs2024 · 9 cites

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models2024 · 9 cites

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models2024 · 7 cites

Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study2024 · 7 cites

ReCode: Robustness Evaluation of Code Generation Models2022 · 6 cites

CodeCoT: Tackling Code Syntax Errors in CoT Reasoning for Code Generation2023 · 6 cites

NExT: Teaching Large Language Models to Reason about Code Execution2024 · 6 cites

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning2024 · 6 cites

Is Self-Repair a Silver Bullet for Code Generation?2023 · 5 cites

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution2024 · 5 cites

Predicting Code Coverage without Execution2023 · 4 cites

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement2024 · 4 cites

Software Vulnerability and Functionality Assessment using LLMs2024 · 4 cites

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM2024 · 4 cites

Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization2024 · 4 cites

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging2024 · 4 cites

Reasoning Runtime Behavior of a Program with LLM: How Far Are We?2024 · 3 cites

Large Language Model Evaluation Via Multi AI Agents: Preliminary results2024 · 3 cites

LeTI: Learning to Generate from Textual Interactions2023 · 2 cites

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency2023 · 2 cites

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules2023 · 2 cites

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness2024 · 2 cites

HumanEval on Latest GPT Models -- 20242024 · 2 cites

CodeShell Technical Report2024 · 2 cites

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation2024 · 2 cites

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts2024 · 2 cites

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts2024 · 2 cites

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation2024 · 2 cites

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark2024 · 2 cites

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models2024 · 2 cites

SelfCodeAlign: Self-Alignment for Code Generation2024 · 2 cites

A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks2024 · 2 cites

The Program Testing Ability of Large Language Models for Code2023 · 1 cites

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers2024 · 1 cites

AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}2024 · 1 cites

Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code2024 · 1 cites

Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation2024 · 1 cites

Towards Large Language Model Aided Program Refinement2024 · 1 cites

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation2024 · 1 cites

Planning In Natural Language Improves LLM Search For Code Generation2024 · 1 cites

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data2024 · 1 cites

Multi-Programming Language Ensemble for Code Generation in Large Language Model2024 · 1 cites

Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity2024 · 1 cites

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis2024 · 1 cites

Self-Evolving Multi-Agent Collaboration Networks for Software Development2024 · 1 cites

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models2024 · 1 cites

Does Few-Shot Learning Help LLM Performance in Code Synthesis?2024 · 1 cites

PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback2024 · 1 cites