BIG-Bench Hard bbh Leaderboard
23 challenging tasks where prior LLMs underperformed humans. Β· Metric: Accuracy (higher is better)
| # | Model | Accuracy | Paper |
|---|---|---|---|
| 1 | BenevolenceMessiah/Qwen2.5-72B-2x-Instruct-TIES-v1.0 | 61.91 | β |
| 2 | Baptiste-HUVELLE-10/LeTriomphant2.2_ECE_iLAB | 61.61 | β |
| 3 | EVA-UNIT-01/EVA-Qwen2.5-72B-v0.2 | 59.07 | β |
| 4 | Aryanne/QwentileSwap | 57.68 | β |
| 5 | CombinHorizon/zetasepic-abliteratedV2-Qwen2.5-32B-Inst-BaseMerge-TIES | 56.83 | β |
| 6 | CombinHorizon/huihui-ai-abliterated-Qwen2.5-32B-Inst-BaseMerge-TIES | 56.04 | β |
| 7 | Daemontatox/PathFinderAi3.0 | 55.54 | β |
| 8 | Daemontatox/CogitoZ | 53.89 | β |
| 9 | EpistemeAI/DeepThinkers-Phi4 | 53.79 | β |
| 10 | Danielbrdz/Barcenas-14b-phi-4 | 53.26 | β |
| 11 | Daemontatox/PathFinderAI2.0 | 52.96 | β |
| 12 | DoppelReflEx/MiniusLight-24B-v1c-test | 52.84 | β |
| 13 | BAAI/Infinity-Instruct-7M-Gen-Llama3_1-70B | 52.50 | β |
| 14 | 1024m/PHI-4-Hindi | 52.46 | β |
| 15 | DoppelReflEx/MiniusLight-24B-v1d-test | 52.36 | β |
| 16 | Daemontatox/PathfinderAI | 52.32 | β |
| 17 | Daemontatox/Llama3.3-70B-CogniLink | 52.12 | β |
| 18 | BAAI/Infinity-Instruct-3M-0625-Llama3-70B | 52.03 | β |
| 19 | Cran-May/merge_model_20250308_4 | 52.02 | β |
| 20 | BAAI/Infinity-Instruct-3M-0613-Llama3-70B | 51.33 | β |
| 21 | Danielbrdz/Barcenas-14b-Phi-3-medium-ORPO | 51.03 | β |
| 22 | CultriX/Qwen2.5-14B-MergeStock | 51.01 | β |
| 23 | Cran-May/merge_model_20250308_2 | 51.00 | β |
| 24 | CultriX/Qwen2.5-14B-MegaMerge-pt2 | 50.91 | β |
| 25 | CultriX/Qwen2.5-14B-ReasoningMerge | 50.87 | β |
| 26 | CultriX/SeQwence-14B-EvolMerge | 50.78 | β |
| 27 | CultriX/Qwen2.5-14B-Wernicke | 50.64 | β |
| 28 | DoppelReflEx/MiniusLight-24B-v1b-test | 50.64 | β |
| 29 | Cran-May/tempmotacilla-cinerea-0308 | 50.60 | β |