OSWorld (Verified) osworld Leaderboard
OSWorld success rate on the ARC-Prize-style VERIFIED set (361 real computer-use tasks across Chrome, GIMP, LibreOffice, VS Code, the OS, and more). Only verified, reproducible runs count β self-reported numbers are excluded β so this is the honest SOTA, not the inflated self-reports. Success Rate is the % of tasks the agent completes end-to-end; each model is shown at its best verified max-steps configuration. Β· Metric: Success Rate (higher is better) Β· π’ Updated 1h ago
| # | Model | Success Rate | Paper |
|---|---|---|---|
| 1 | Pointer Agent w/ Opus 4.7 | 83.64 | link |
| 2 | Holo3-35B-A3B | 82.56 | link |
| 3 | Pointer Agent w/ Sonnet 4.6 | 81.45 | link |
| 4 | OpenAPA w/ gemini-3.1-pro | 78.34 | link |
| 5 | VLAA-GUI w/ Opus 4.5 | 76.26 | link |
| 6 | MiniMax M3 | 75.19 | link |
| 7 | HIPPO Agent w/ Opus 4.5 | 74.48 | link |
| 8 | Qwen 3.7 Plus | 73.30 | link |
| 9 | Kimi K2.6 | 73.06 | link |
| 10 | agent s3 w/ Opus 4.5 + GPT-5 bBoN (N=10) | 72.58 | link |
| 11 | claude-sonnet-4-6 | 72.11 | link |
| 12 | agent s3 w/ GPT-5 bBoN (N=10) | 69.90 | link |
| 13 | agent s3 w/ Opus 4.5 bBoN (N=1) | 67.46 | link |
| 14 | UiPath Screen Agent w/ Opus 4.5 | 67.14 | link |
| 15 | OS-Symphony w/ GPT-5 | 65.77 | link |
| 16 | agent s3 w/ GPT-5 bBoN (N=1) | 65.58 | link |
| 17 | GBOX Agent | 64.22 | link |
| 18 | GTA1 w/ GPT-5 | 63.41 | link |
| 19 | Kimi K2.5 | 63.30 | link |
| 20 | claude-sonnet-4-5-20250929 | 62.88 | link |
| 21 | Agentic-Lybic-Maestro | 61.93 | link |
| 22 | Seed-1.8 | 61.87 | link |
| 23 | CoACT-1 | 60.76 | link |
| 24 | aworldGUIAgent-v1 | 58.04 | link |
| 25 | EvoCUA-20260105 | 56.73 | link |
| 26 | agent s2.5 w/ o3 | 56.00 | link |
| 27 | GUI-Owl-1.5 32B | 55.44 | link |
| 28 | DeepMiner-Mano-72B | 53.91 | link |
| 29 | UiPath Screen Agent w/ GPT-5 | 53.63 | link |
| 30 | GTA1 w/ o3 | 53.10 | link |