SWE-bench
Canonical91papers using it
2023first seen
SWE-bench is a benchmark used to evaluate the performance of agent development kits (ADKs) by assessing the effectiveness of the agents they produce through a controlled methodology involving an LLM coding agent.
Papers using SWE-bench (91)
- ADK Arena: Evaluating Agent Development Kits via LLM-as-a-DeveloperEvaluating Agent-based Program Repair at GoogleUTBoost: Rigorous Evaluation of Coding Agents on SWE-BenchSWE-bench Goes Live!Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel BugsTestGenEval: A Real World Unit Test Generation and Test Completion
BenchmarkHyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at ScaleSWE-Explore: Benchmarking How Coding Agents Explore RepositoriesFrom SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering AgentsCODESTRUCT: Code Agents over Structured Action SpacesSWE-Shepherd: Advancing PRMs for Reinforcing Code AgentsFrom Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to PythonQwen3-Coder-Next Technical Report$V_1$: Unifying Generation and Self-Verification for Parallel ReasonersSWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request FeedbackCompressing Code Context for LLM-based Issue ResolutionOmniCode: A Benchmark for Evaluating Software Engineering AgentsSVRepair: Structured Visual Reasoning for Automated Program RepairPull Requests as a Training Signal for Repo-Level Code EditingFeatureBench: Benchmarking Agentic Coding for Complex Feature DevelopmentHybrid-Gym: Training Coding Agents to Generalize Across TasksNarrowing the Complexity Gap in the Evaluation of Large Language ModelsSWE-Prot\'eg\'e: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering AgentsMemGovern: Enhancing Code Agents through Learning from Governed Human ExperiencesSWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source RepositoriesUnderstanding Code Agent Behaviour: An Empirical Study of Success and Failure TrajectoriesSWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering TasksSkyRL-Agent: Efficient RL Training for Multi-turn LLM AgentMulti-Agent Code Verification via Information TheorySaving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent EvaluationMore with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding AgentsWhen "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?Building Coding Agents via Entropy-Enhanced Multi-Turn Preference OptimizationKimi-Dev: Agentless Training as Skill Prior for SWE-AgentsA Benchmark for Localizing Code and Non-Code Issues in Software ProjectsBloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR SolutionsLita: Light Agent Uncovers the Agentic Coding Capabilities of LLMsThe Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context ManagementTraining Long-Context, Multi-Turn Software Engineering Agents with Reinforcement LearningHow Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-benchSWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering TasksSWE-Debate: Competitive Multi-Agent Debate for Software Issue ResolutionSWE-Exp: Experience-Driven Software Issue ResolutionTrae Agent: An LLM-based Agent for Software Engineering with Test-time ScalingThe Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software EngineeringMCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue ResolutionUnified Software Engineering Agent as AI Software EngineerSemAgent: A Semantics Aware Program Repair AgentSWE-rebench: An Automated Pipeline for Task Collection and
Decontaminated Evaluation of Software Engineering AgentsSatori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software
EngineeringTraining Long-Context, Multi-Turn Software Engineering Agents with
Reinforcement LearningSWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering
Tasks?When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches
Generated by Code Agents?Web-Bench: A LLM Code Benchmark Based on Web Standards and FrameworksSWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering AgentsSatori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software EngineeringSWE-smith: Scaling Data for Software Engineering AgentsAutomated Benchmark Generation for Repository-Level Coding TasksBeyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub ScenariosPatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal VerificationCodeMonkeys: Scaling Test-Time Compute for Software EngineeringLarge Language Model Critics for Execution-Free Evaluation of Code
ChangesTowards Exception Safety Code Generation with Intermediate Representation Agents FrameworkRepoGraph: Enhancing AI Software Engineering with Repository-level Code
GraphBLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example LearningSWE-bench: Can Language Models Resolve Real-World GitHub Issues?SWE-agent: Agent-Computer Interfaces Enable Automated Software
EngineeringMAGIS: LLM-Based Multi-Agent Framework for GitHub Issue ResolutionSWE-Bench+: Enhanced Coding Benchmark for LLMsMarsCode Agent: AI-native Automated Bug FixingSWE-bench-java: A GitHub Issue Resolving Benchmark for JavaGranite-Function Calling Model: Introducing Function Calling Abilities
via Multi-task Learning of Granular TasksCodexGraph: Bridging Large Language Models and Code Repositories via
Code Graph DatabasesExploring the Potential of Conversational Test Suite Based Program
Repair on SWE-benchCodeTree: Agent-guided Tree Search for Code Generation with Large
Language ModelsCan Github issues be solved with Tree Of Thoughts?SpecRover: Code Intent Extraction via LLMsSWE-bench Multimodal: Do AI Systems Generalize to Visual Software
Domains?REDO: Execution-Free Runtime Error Detection for COding AgentsA Real-World Benchmark for Evaluating Fine-Grained Issue Solving
Capabilities of Large Language ModelsSWE-Debate: Competitive Multi-Agent Debate for Software Issue ResolutionCodexGraph: Bridging Large Language Models and Code Repositories via
Code Graph DatabasesSWE-bench-java: A GitHub Issue Resolving Benchmark for JavaCodeMonkeys: Scaling Test-Time Compute for Software EngineeringSWE-smith: Scaling Data for Software Engineering AgentsSWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source RepositoriesMemGovern: Enhancing Code Agents through Learning from Governed Human ExperiencesFeatureBench: Benchmarking Agentic Coding for Complex Feature DevelopmentQwen3-Coder-Next Technical ReportSWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous IntegrationV_1: Unifying Generation and Self-Verification for Parallel Reasoners