ARC-AGI-1 arc-agi-1 Leaderboard
ARC-AGI-1 (Semi-Private Eval) — the original Abstract Reasoning Corpus benchmark by François Chollet. 400 novel visual reasoning tasks designed so memorisation cannot substitute for genuine generalisation. Human Panel baseline is ~98%. Scores are on the withheld semi-private evaluation set (not the public eval set). Each model is shown at its best score across compute budgets; Human and Kaggle competition entries are excluded. Accuracy is the % of tasks solved. · Metric: Accuracy (higher is better) · 🟢 Updated 23h ago
| # | Model | Accuracy | Paper |
|---|---|---|---|
| 1 | Gemini 3.1 Pro (Preview) | 98.00 | link |
| 2 | GPT-5.5 Pro (High) | 96.50 | link |
| 3 | Gemini 3 Deep Think (2/26) | 96.00 | link |
| 4 | GPT-5.5 (xHigh) | 95.00 | link |
| 5 | GPT-5.2 (Refine.) | 94.50 | link |
| 6 | GPT-5.4 Pro (xHigh) | 94.50 | link |
| 7 | Claude Opus 4.6 (120K, High) | 94.00 | link |
| 8 | GPT-5.4 (xHigh) | 93.67 | link |
| 9 | Claude 4.7 (High) | 93.50 | link |
| 10 | Claude Opus 4.8 (Max) | 92.50 | link |
| 11 | Gemini 3.5 Flash (High) | 92.50 | link |
| 12 | GPT-5.2 Pro (xHigh) | 90.50 | link |
| 13 | Grok 4.20 (Reasoning) | 89.50 | link |
| 14 | Gemini 3 Deep Think (Preview) ² | 87.50 | link |
| 15 | Claude Sonnet 4.6 (High) | 86.50 | link |
| 16 | GPT-5.2 (xHigh) | 86.17 | link |
| 17 | Gemini 3 Flash Preview (High) | 84.67 | link |
| 18 | Opus 4.5 (Thinking, 64K) | 80.00 | link |
| 19 | Grok 4 (Refine.) | 79.60 | link |
| 20 | GLM-5.2 | 77.00 | link |
| 21 | Gemini 3 Pro | 75.00 | link |
| 22 | GPT-5.1 (Thinking, High) | 72.83 | link |
| 23 | GPT-5 Pro | 70.17 | link |
| 24 | Grok 4 (Thinking) | 66.67 | link |
| 25 | GPT-5 (High) | 65.67 | link |
| 26 | Kimi K2.5 | 65.33 | link |
| 27 | Claude Sonnet 4.5 (Thinking 32K) | 63.67 | link |
| 28 | GPT-5.4 Mini (xHigh) | 63.67 | link |
| 29 | Minimax M2.5 | 63.67 | link |
| 30 | o3 (High) | 60.83 | link |