← all datasets

SWE-bench

Canonical

84papers using it

66,082HF downloads

144HF likes

2023first seen

Real GitHub issues paired with their fix PRs across Python repositories; models must produce patches that pass the repo's tests.

🤗 Hugging Face

Papers using SWE-bench (84)

Evaluating Agent-based Program Repair at Google2025 · 4 cites

SWE-Exp: Experience-Driven Software Issue Resolution2025 · 3 cites

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer2026

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution2025 · 2 cites

Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling2025 · 2 cites

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories2025 · 1 cites

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench2025 · 1 cites

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults2025 · 1 cites

SWE-bench Goes Live!2025 · 1 cites

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark2024 · 1 cites

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale2024 · 1 cites

SWE-Explore: Benchmarking How Coding Agents Explore Repositories2026

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair2026

An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning2026

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents2026

CODESTRUCT: Code Agents over Structured Action Spaces2026

SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents2026

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python2026

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis2026

SWE-QA: A Dataset and Benchmark for Complex Code Understanding2026

Qwen3-Coder-Next Technical Report2026

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners2026

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback2026

Compressing Code Context for LLM-based Issue Resolution2026

OmniCode: A Benchmark for Evaluating Software Engineering Agents2026

SVRepair: Structured Visual Reasoning for Automated Program Repair2026

Pull Requests as a Training Signal for Repo-Level Code Editing2026

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development2026

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks2026

Narrowing the Complexity Gap in the Evaluation of Large Language Models2026

SWE-Prot\'eg\'e: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents2026

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences2026

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories2025

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks2025

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent2025

Multi-Agent Code Verification via Information Theory2025

Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation2025

More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?2025

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization2025

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents2025

A Benchmark for Localizing Code and Non-Code Issues in Software Projects2025

BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions2025

Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs2025

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management2025

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning2025

How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench2025

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks2025

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering2025

MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution2025

Unified Software Engineering Agent as AI Software Engineer2025

SemAgent: A Semantics Aware Program Repair Agent2025

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning2025

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?2025

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks2025

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents2025

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering2025

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents2025

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering2025

SWE-smith: Scaling Data for Software Engineering Agents2025

Automated Benchmark Generation for Repository-Level Coding Tasks2025

Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios2025

PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification2025

CodeMonkeys: Scaling Test-Time Compute for Software Engineering2025

Large Language Model Critics for Execution-Free Evaluation of Code Changes2025

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph2024 · 1 cites

Towards Exception Safety Code Generation with Intermediate Representation Agents Framework2024

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?2023 · 48 cites

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering2024 · 29 cites

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution2024 · 9 cites

SWE-Bench+: Enhanced Coding Benchmark for LLMs2024 · 5 cites

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java2024 · 3 cites

MarsCode Agent: AI-native Automated Bug Fixing2024 · 3 cites

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?2024 · 2 cites

Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks2024 · 1 cites

CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases2024 · 1 cites

Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench2024 · 1 cites

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models2024 · 1 cites

Can Github issues be solved with Tree Of Thoughts?2024

SpecRover: Code Intent Extraction via LLMs2024

REDO: Execution-Free Runtime Error Detection for COding Agents2024

A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models2024

SWE-bench dataset — papers, benchmarks & downloads · AI for Code