HallusionBench hallusionbench Leaderboard
Diagnostic benchmark for visual illusion and knowledge hallucination in vision-language models β probes whether a model over-relies on language priors versus genuinely reading the image. Overall question-pair accuracy. Β· Metric: Accuracy (higher is better)
| # | Model | Accuracy | Paper |
|---|---|---|---|
| 1 | SenseNova-V6-Pro | 67.10 | link |
| 2 | SenseNova-V6-5-Pro | 66.70 | link |
| 3 | GPT-5-20250807 | 65.20 | link |
| 4 | JT-VL-Chat-V3.0 | 64.40 | link |
| 5 | Gemini-2.5-Pro | 64.10 | link |
| 6 | MiMo-VL-7B | 63.80 | link |
| 7 | CongRong-v2.0 | 63.20 | link |
| 8 | BlueLM-2.6-3B | 63.10 | link |
| 9 | GPT-5-mini-20250807 | 62.50 | link |
| 10 | GPT-5-nano-20250807 | 60.90 | link |
| 11 | TeleMM | 60.60 | link |
| 12 | BailingMM-Lite-1203 | 60.10 | link |
| 13 | BlueLM-2.5-3B | 60.00 | link |
| 14 | GPT-4.5 | 60.00 | link |
| 15 | R-4B | 60.00 | link |
| 16 | Kimi-VL-A3B-Thinking-2506 | 59.80 | link |
| 17 | InternVL2.5-38B-MPO | 59.70 | link |
| 18 | BailingMM-Pro-0120 | 59.40 | link |
| 19 | Qwen-VL-Max-0809 | 59.20 | link |
| 20 | InternVL3-78B | 59.10 | link |
| 21 | Ovis2-34B | 58.80 | link |
| 22 | Qwen2-VL-72B | 58.70 | link |
| 23 | GLM-4v-Plus-20250111 | 58.50 | link |
| 24 | GPT-4.1-20250414 | 58.50 | link |
| 25 | InternVL3-38B | 58.40 | link |
| 26 | Qwen2.5-VL-32B | 58.40 | link |
| 27 | InternVL2.5-78B-MPO | 58.10 | link |
| 28 | Gemini-2.0-Flash | 58.00 | link |
| 29 | InternVL2.5-38B | 57.90 | link |
| 30 | HunYuan-Standard-Vision | 57.70 | link |