GAIA Benchmark (2023) gaia Leaderboard
GAIA is a 466-question benchmark for general AI assistants β multi-step reasoning, multi-modality, web/tool use, file handling. Real human-level questions; aimed at testing assistant-style agents end-to-end. Score is the share of questions an agent solves correctly across all 3 difficulty levels (0-1). Β· Metric: Score (higher is better)
| # | Model | Score | Paper |
|---|---|---|---|
| 1 | Agent_v0.1.4 | 0.83 | β |
| 2 | Skywork Deep Research Agent v2 | 0.83 | β |
| 3 | Agent_v0.1.3 | 0.82 | β |
| 4 | π¦βπ₯ AWorld (Run Instantly) | 0.82 | β |
| 5 | Agent_v0.1.2 | 0.81 | β |
| 6 | Agent_v0.1.1 | 0.80 | β |
| 7 | h2oGPTe Agent v1.6.33 | 0.80 | β |
| 8 | Su Zero Ultra | 0.80 | β |
| 9 | Agent2030-v2.3 | 0.79 | β |
| 10 | Agent_v0.1.0 | 0.79 | β |
| 11 | h2oGPTe Agent v1.6.32 | 0.79 | β |
| 12 | desearch | 0.78 | β |
| 13 | 𦀠AWorld (Run Instantly) | 0.77 | β |
| 14 | Agent2030-v2.2 | 0.76 | β |
| 15 | SU AI Zero | 0.76 | β |
| 16 | Agent_v0.0.9 | 0.75 | β |
| 17 | Alita | 0.75 | β |
| 18 | h2oGPTe Agent v1.6.27 | March 17 original date | 0.75 | β |
| 19 | Agent2030-v2.1 | 0.74 | β |
| 20 | Agent_v0.0.8 | 0.73 | β |
| 21 | AgentZ_v0.10 | 0.73 | β |
| 22 | Langfun Agent v2.3 | 0.73 | β |
| 23 | Agent2030-v2.0 | 0.72 | β |
| 24 | agent 90000 | 0.72 | β |
| 25 | agent-pro | 0.72 | β |
| 26 | agent zero v1.2 | 0.72 | β |
| 27 | 𦩠AWorld (Run Instantly) | 0.72 | β |
| 28 | Langfun Agent v2.2 | 0.72 | β |
| 29 | agent333 | 0.71 | β |
| 30 | agent zero v1.1 | 0.71 | β |