← all datasets

SWE-bench Verified

Canonical

45papers using it

2025first seen

A human-validated subset of SWE-bench with confirmed-solvable, well-specified issues.

🔎 Find this dataset

Papers using SWE-bench Verified (45)

Function-Aware Fill-in-the-Middle as Mid-Training for Coding Agent Foundation Models2026

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry2026

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills2026

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness2026

Single-Rollout Asynchronous Optimization for Agentic Reinforcement Learning2026

LLM-as-a-Verifier: A General-Purpose Verification Framework2026

Decentralized Multi-Agent Systems with Shared Context2026

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context2026

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws2026

Don't Blame the Large Language Model: How Agent Harness Evolution Shapes Coding Agent Quality2026

CompactionRL: Reinforcement Learning with Context Compaction for Long-Horizon Agents2026

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents2026

Open-SWE-Traces: Advancing Dual-Mode Multilingual Distillation for Software Engineering Agents2026

Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems2026

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses2026 · 1 cites

Hybrid-gym: Training Coding Agents To Generalize Across Tasks2026

What Context Does a Coding Agent Actually Need to Act?2026

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory2026

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages2026

SHERLOC: Structured Diagnostic Localization for Code Repair Agents2026

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents2026

Automated Benchmark Auditing for AI Agents and Large Language Models2026

CoMem: Context Management with A Decoupled Long-Context Model2026

Swe-bench-cl: Continual Learning For Coding Agents2025

Swe-prot\'eg\'e: Learning To Selectively Collaborate With An Expert Unlocks Small Language Models As Software Engineering Agents2026

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing2026

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents2026

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents2026

SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent2026

Evaluating Plan Compliance In Autonomous Programming Agents2026

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code2026

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents2026

SWE-Universe: Scale Real-World Verifiable Environments to Millions2026

SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training2026

EvoMAS: Evolutionary Generation of Multi-Agent Systems2026

Group-evolving Agents: Open-ended Self-improvement Via Experience Sharing2026

Learning Adaptive Parallel Execution for Efficient Code Localization2026

Toward Training Superintelligent Software Agents through Self-Play SWE-RL2025

SWE-EVO: Benchmarking Coding Agents In Long-horizon Software Evolution Scenarios2025

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement2025

R2e-gym: Procedural Environments And Hybrid Verifiers For Scaling Open-weights SWE Agents2025

Putting It All Into Context: Simplifying Agents With Lclms2025

A Self-improving Coding Agent2025

Establishing Best Practices For Building Rigorous Agentic Benchmarks2025

Guided Search Strategies In Non-serializable Environments With Applications To Software Engineering Agents2025

SWE-bench Verified dataset — papers, benchmarks & downloads · AI Agents