SWE-bench
Canonical43papers using it
2024first seen
'SWE-bench' is a dataset containing 64,380 runs from 126 software engineering agent configurations across 43 frameworks, used to evaluate the behavioral differences and performance outcomes of various LLM-based software engineering agents.
Papers using SWE-bench (43)
- SWE-Explore: Benchmarking How Coding Agents Explore RepositoriesSwe-agent: Agent-computer Interfaces Enable Automated Software EngineeringQwen3-Coder-Next Technical ReportOrchard: An Open-Source Agentic Modeling FrameworkAgentcgroup: Understanding And Controlling OS Resources Of AI AgentsAgentic Software: How AI Agents Are Restructuring the Software ParadigmThe Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes NecessaryADK Arena: Evaluating Agent Development Kits via LLM-as-a-DeveloperBeyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic JudgmentYour Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous AgentsGreen SARC: Predictive Cost and Carbon Governance for Agentic AI SystemsTwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM RoutingSpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM AgentsAdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance ConvergenceAdarubric: Task-adaptive Rubrics For LLM Agent EvaluationHybrid-gym: Training Coding Agents To Generalize Across TasksInfantagent-next: A Multimodal Generalist Agent For Automated Computer InteractionCalibrating Conservatism for Scalable OversightPatchpilot: A Cost-efficient Software Engineering Agent With Early Attempts On Formal VerificationSwe-bench Goes Live!Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel FaultsOmniCode: A Benchmark for Evaluating Software Engineering AgentsSame Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering AgentsHow Do AI Agents Spend Your Money? Analyzing And Predicting Token Consumption In Agentic Coding TasksSwe-prot\'eg\'e: Learning To Selectively Collaborate With An Expert Unlocks Small Language Models As Software Engineering AgentsFrom Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to PythonReact-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One WeekendSWE-Edit: Rethinking Code Editing for Efficient SWE-AgentAOrchestra: Automating Sub-Agent Creation for Agentic OrchestrationEvoMAS: Evolutionary Generation of Multi-Agent SystemsAgentSpawn: Adaptive Multi-Agent Collaboration Through Dynamic Spawning for Long-Horizon Code GenerationPull Requests as a Training Signal for Repo-Level Code EditingAgent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path ForwardSGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software RepairSkyRL-Agent: Efficient RL Training for Multi-turn LLM AgentUnderstanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical StudySwe-rebench: An Automated Pipeline For Task Collection And Decontaminated Evaluation Of Software Engineering AgentsWhen Agents Go Astray: Course-correcting SWE Agents With PrmsBreakpoint: Scalable Evaluation Of System-level Reasoning In LLM Code AgentsPutting It All Into Context: Simplifying Agents With LclmsKimi-dev: Agentless Training As Skill Prior For Swe-agentsHyperagent: Generalist Software Engineering Agents To Solve Coding Tasks At ScaleREDO: Execution-free Runtime Error Detection For Coding Agents