CyberMetric (10K) cybermetric Leaderboard
CyberMetric LLM cybersecurity-knowledge benchmark β accuracy on the largest (10,000-question) split, RAG-generated from NIST standards, research papers and security textbooks. Sourced from the FrontierAI Cybersecurity leaderboard's aggregation of published model results. Β· Metric: Accuracy (higher is better)
| # | Model | Accuracy | Paper |
|---|---|---|---|
| 1 | GPT-4o | 88.89 | β |
| 2 | GPT-4-Turbo | 88.50 | β |
| 3 | Gemini Pro 1.0 | 87.50 | β |
| 4 | Falcon-180B-Chat | 87.00 | β |
| 5 | Mixtral-8x7B-Instruct | 87.00 | β |
| 6 | GPT-3.5-Turbo | 80.30 | β |
| 7 | Mistral-7B-Instruct-v0.2 | 74.82 | β |
| 8 | Gemma-1.1-7B | 73.32 | β |
| 9 | Llama-3-8B-Instruct | 71.25 | β |
| 10 | Flan-T5-XXL | 67.50 | β |
| 11 | Llama 2-70B | 66.10 | β |
| 12 | Zephyr-7B-beta | 65.00 | β |
| 13 | Qwen1.5-MoE-A2.7B | 60.73 | β |
| 14 | Qwen1.5-7B | 59.79 | β |
| 15 | Qwen-7B | 54.09 | β |
| 16 | Phi-2 | 52.13 | β |
| 17 | DeciLM-7B | 50.75 | β |
| 18 | Llama3-ChatQA-1.5-8B | 49.64 | β |
| 19 | Qwen1.5-4B | 40.29 | β |
| 20 | Genstruct-7B | 36.93 | β |
| 21 | Llama-3-8B | 36.00 | β |
| 22 | Gemma-7B | 34.28 | β |
| 23 | Dolly V2 12b BF16 | 27.00 | β |
| 24 | Gemma-2B | 19.18 | β |
| 25 | Phi-3-mini-4k-Instruct | 4.80 | β |