SWE-bench bash-only swe-bench-bash-only Leaderboard
SWE-bench bash-only β the agent operates through a bare shell (no scaffolding/tooling beyond bash), isolating raw model capability from agent-framework engineering. Score is the % of issues resolved. Β· Metric: % Resolved (higher is better)
| # | Model | % Resolved | Paper |
|---|---|---|---|
| 1 | Claude 4.5 Opus (high reasoning) | 76.80 | link |
| 2 | Gemini 3 Flash (high reasoning) | 75.80 | link |
| 3 | MiniMax M2.5 (high reasoning) | 75.80 | link |
| 4 | Claude Opus 4.6 | 75.60 | link |
| 5 | Claude 4.5 Opus medium (20251101) | 74.40 | link |
| 6 | Gemini 3 Pro Preview (2025-11-18) | 74.20 | link |
| 7 | GLM-5 (high reasoning) | 72.80 | link |
| 8 | GPT 5.2 Codex | 72.80 | link |
| 9 | GPT-5-2 Codex | 72.80 | link |
| 10 | GPT-5-2 (high reasoning) | 72.80 | link |
| 11 | GPT-5.2 (2025-12-11) (high reasoning) | 71.80 | link |
| 12 | Claude 4.5 Sonnet (high reasoning) | 71.40 | link |
| 13 | Kimi K2.5 (high reasoning) | 70.80 | link |
| 14 | Claude 4.5 Sonnet (20250929) | 70.60 | link |
| 15 | DeepSeek V3.2 (high reasoning) | 70.00 | link |
| 16 | Gemini 3 Pro | 69.60 | link |
| 17 | GPT-5.2 (2025-12-11) | 69.00 | link |
| 18 | Claude 4 Opus (20250514) | 67.60 | link |
| 19 | Claude 4.5 Haiku (high reasoning) | 66.60 | link |
| 20 | GPT-5.1 (2025-11-13) (medium reasoning) | 66.00 | link |
| 21 | GPT-5.1-codex (medium reasoning) | 66.00 | link |
| 22 | GPT-5 (2025-08-07) (medium reasoning) | 65.00 | link |
| 23 | Claude 4 Sonnet (20250514) | 64.93 | link |
| 24 | Kimi K2 Thinking | 63.40 | link |
| 25 | Minimax M2 | 61.00 | link |
| 26 | DeepSeek V3.2 Reasoner | 60.00 | β |
| 27 | GPT-5 mini (2025-08-07) (medium reasoning) | 59.80 | link |
| 28 | o3 (2025-04-16) | 58.40 | link |
| 29 | Devstral small (2512) | 56.40 | β |
| 30 | GPT-5 Mini | 56.20 | link |