SWE-bench Multilingual swe-bench-multilingual Leaderboard
SWE-bench Multilingual β the SWE-bench task formulation extended beyond Python to issues across multiple programming languages, testing how well a coding agent generalises across language ecosystems. Score is the % of issues resolved. Β· Metric: % Resolved (higher is better)
| # | Model | % Resolved | Paper |
|---|---|---|---|
| 1 | Gemini 3 Flash | 72.70 | link |
| 2 | Claude 4.6 Opus | 72.00 | link |
| 3 | Claude 4.5 Opus | 70.70 | link |
| 4 | GLM-5 | 69.70 | link |
| 5 | Gemini 3 Pro | 68.70 | link |
| 6 | Minimax 2.5 | 68.30 | link |
| 7 | Kimi K2.5 | 67.30 | link |
| 8 | Claude 4.5 Sonnet | 67.00 | link |
| 9 | GPT-5.2 (high reasoning) | 66.70 | link |
| 10 | GPT 5.2 Codex | 66.30 | link |
| 11 | GPT-5-2 Codex | 66.30 | link |
| 12 | Claude 4.5 Haiku | 64.70 | link |
| 13 | DeepSeek V3.2 | 59.00 | link |
| 14 | GPT-5 mini | 39.70 | link |