OSWorld (Verified) osworld Leaderboard

OSWorld success rate on the ARC-Prize-style VERIFIED set (361 real computer-use tasks across Chrome, GIMP, LibreOffice, VS Code, the OS, and more). Only verified, reproducible runs count — self-reported numbers are excluded — so this is the honest SOTA, not the inflated self-reports. Success Rate is the % of tasks the agent completes end-to-end; each model is shown at its best verified max-steps configuration. · Metric: Success Rate (higher is better) · 🟢 Updated 1h ago

Source ↗

#	Model	Success Rate	Paper
1	Pointer Agent w/ Opus 4.7	83.64	link
2	Holo3-35B-A3B	82.56	link
3	Pointer Agent w/ Sonnet 4.6	81.45	link
4	OpenAPA w/ gemini-3.1-pro	78.34	link
5	VLAA-GUI w/ Opus 4.5	76.26	link
6	MiniMax M3	75.19	link
7	HIPPO Agent w/ Opus 4.5	74.48	link
8	Qwen 3.7 Plus	73.30	link
9	Kimi K2.6	73.06	link
10	agent s3 w/ Opus 4.5 + GPT-5 bBoN (N=10)	72.58	link
11	claude-sonnet-4-6	72.11	link
12	agent s3 w/ GPT-5 bBoN (N=10)	69.90	link
13	agent s3 w/ Opus 4.5 bBoN (N=1)	67.46	link
14	UiPath Screen Agent w/ Opus 4.5	67.14	link
15	OS-Symphony w/ GPT-5	65.77	link
16	agent s3 w/ GPT-5 bBoN (N=1)	65.58	link
17	GBOX Agent	64.22	link
18	GTA1 w/ GPT-5	63.41	link
19	Kimi K2.5	63.30	link
20	claude-sonnet-4-5-20250929	62.88	link
21	Agentic-Lybic-Maestro	61.93	link
22	Seed-1.8	61.87	link
23	CoACT-1	60.76	link
24	aworldGUIAgent-v1	58.04	link
25	EvoCUA-20260105	56.73	link
26	agent s2.5 w/ o3	56.00	link
27	GUI-Owl-1.5 32B	55.44	link
28	DeepMiner-Mano-72B	53.91	link
29	UiPath Screen Agent w/ GPT-5	53.63	link
30	GTA1 w/ o3	53.10	link