SWE-bench

Canonical

43papers using it

2024first seen

'SWE-bench' is a dataset containing 64,380 runs from 126 software engineering agent configurations across 43 frameworks, used to evaluate the behavioral differences and performance outcomes of various LLM-based software engineering agents.

🔎 Find this dataset

Papers using SWE-bench (43)

SWE-Explore: Benchmarking How Coding Agents Explore Repositories2026

Swe-agent: Agent-computer Interfaces Enable Automated Software Engineering2024 · 27 cites

Qwen3-Coder-Next Technical Report2026

Orchard: An Open-Source Agentic Modeling Framework2026

Agentcgroup: Understanding And Controlling OS Resources Of AI Agents2026

Agentic Software: How AI Agents Are Restructuring the Software Paradigm2026

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary2026

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer2026

Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment2026

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents2026

Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems2026

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing2026

SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents2026

AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence2026

Adarubric: Task-adaptive Rubrics For LLM Agent Evaluation2026

Hybrid-gym: Training Coding Agents To Generalize Across Tasks2026

Infantagent-next: A Multimodal Generalist Agent For Automated Computer Interaction2025

Calibrating Conservatism for Scalable Oversight2026

Patchpilot: A Cost-efficient Software Engineering Agent With Early Attempts On Formal Verification2025

Swe-bench Goes Live!2025 · 1 cites

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults2025 · 1 cites

OmniCode: A Benchmark for Evaluating Software Engineering Agents2026

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents2026

How Do AI Agents Spend Your Money? Analyzing And Predicting Token Consumption In Agentic Coding Tasks2026

Swe-prot\'eg\'e: Learning To Selectively Collaborate With An Expert Unlocks Small Language Models As Software Engineering Agents2026

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python2026

React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend2026

SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent2026

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration2026

EvoMAS: Evolutionary Generation of Multi-Agent Systems2026

AgentSpawn: Adaptive Multi-Agent Collaboration Through Dynamic Spawning for Long-Horizon Code Generation2026

Pull Requests as a Training Signal for Repo-Level Code Editing2026

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward2026

SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair2026

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent2025

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study2025

Swe-rebench: An Automated Pipeline For Task Collection And Decontaminated Evaluation Of Software Engineering Agents2025

When Agents Go Astray: Course-correcting SWE Agents With Prms2025

Breakpoint: Scalable Evaluation Of System-level Reasoning In LLM Code Agents2025

Putting It All Into Context: Simplifying Agents With Lclms2025

Kimi-dev: Agentless Training As Skill Prior For Swe-agents2025

Hyperagent: Generalist Software Engineering Agents To Solve Coding Tasks At Scale2024 · 1 cites

REDO: Execution-free Runtime Error Detection For Coding Agents2024