ARC-AGI-2 arc-agi-2 Leaderboard
ARC-AGI-2 (Semi-Private Eval) β the second-generation ARC-AGI benchmark, substantially harder than v1. Scores are on the withheld semi-private evaluation set (not the public eval set), so results are not contaminated by public-set overfitting. Human Panel baseline is ~100%. Each model is shown at its best score across compute budgets; Human and Kaggle competition entries are excluded. Accuracy is the % of 400 tasks solved. Β· Metric: Accuracy (higher is better) Β· π’ Updated 23h ago
| # | Model | Accuracy | Paper |
|---|---|---|---|
| 1 | GPT-5.5 (xHigh) | 85.00 | link |
| 2 | Gemini 3 Deep Think (2/26) | 84.58 | link |
| 3 | GPT-5.5 Pro (High) | 84.58 | link |
| 4 | GPT-5.4 Pro (xHigh) | 83.33 | link |
| 5 | Gemini 3.1 Pro (Preview) | 77.08 | link |
| 6 | Claude 4.7 (Max) | 75.83 | link |
| 7 | GPT-5.4 (xHigh) | 73.95 | link |
| 8 | GPT-5.2 (Refine.) | 72.90 | link |
| 9 | Claude Opus 4.8 (High) | 72.08 | link |
| 10 | Gemini 3.5 Flash (High) | 72.08 | link |
| 11 | Claude Opus 4.6 (120K, High) | 69.17 | link |
| 12 | Grok 4.20 (Reasoning) | 65.14 | link |
| 13 | Claude Sonnet 4.6 (High) | 60.42 | link |
| 14 | GPT-5.2 Pro (High) | 54.16 | link |
| 15 | Gemini 3 Pro (Refine.) | 54.00 | link |
| 16 | GPT-5.2 (xHigh) | 52.91 | link |
| 17 | Gemini 3 Deep Think (Preview) Β² | 45.14 | link |
| 18 | Opus 4.5 (Thinking, 64K) | 37.64 | link |
| 19 | Gemini 3 Flash Preview (High) | 33.61 | link |
| 20 | Gemini 3 Pro | 31.11 | link |
| 21 | Grok 4 (Refine.) | 29.44 | link |
| 22 | GLM-5.2 | 22.78 | link |
| 23 | GPT-5.4 Mini (xHigh) | 18.90 | link |
| 24 | GPT-5 Pro | 18.33 | link |
| 25 | GPT-5.1 (Thinking, High) | 17.64 | link |
| 26 | Grok 4 (Thinking) | 15.97 | link |
| 27 | Claude Sonnet 4.5 (Thinking 32K) | 13.61 | link |
| 28 | Kimi K2.5 | 11.81 | link |
| 29 | GPT-5 (High) | 9.86 | link |
| 30 | Claude Opus 4 (Thinking 16K) | 8.61 | link |