Awesome Evaluation
Evaluation is one of the most active areas in Awesome LLM Papers β 7,855 papers in this collection, evaluated on datasets like MMLU, GSM8K, MMLU-Pro. A strong starting point is "Judging Llm-as-a-judge With Mt-bench And Chatbot Arena".
Datasets & benchmarks
Key papers
- Judging Llm-as-a-judge With Mt-bench And Chatbot Arena (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al.46.70
- Llama: Open And Efficient Foundation Language Models (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al.36.83
- Self-rag: Learning To Retrieve, Generate, And Critique Through Self-reflection (2023)Akari Asai, Zeqiu Wu, Yizhong Wang, et al.33.82
- Harmbench: A Standardized Evaluation Framework For Automated Red Teaming And Robust Refusal (2024)Mantas Mazeika, Long Phan, Xuwang Yin, et al.33.01
- Judgelm: Fine-tuned Large Language Models Are Scalable Judges (2023)Lianghui Zhu, Xinggang Wang, Xinlong Wang31.75
- The Fineweb Datasets: Decanting The Web For The Finest Text Data At Scale (2024)Guilherme Penedo, Hynek KydlΓΔek, Loubna Ben Allal, et al.31.68
- Longbench: A Bilingual, Multitask Benchmark For Long Context Understanding (2023)Yushi Bai, Xin Lv, Jiajie Zhang, et al.31.59
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)DeepSeek-AI et al.31.23
- Chatbot Arena: An Open Platform For Evaluating Llms By Human Preference (2024)Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, et al.31.16
- Executable Code Actions Elicit Better LLM Agents (2024)Xingyao Wang, Yangyi Chen, Lifan Yuan, et al.31.13
- The Dawn Of Lmms: Preliminary Explorations With Gpt-4v(ision) (2023)Zhengyuan Yang, Linjie Li, Kevin Lin, et al.30.34
- Adaptive-rag: Learning To Adapt Retrieval-augmented Large Language Models Through Question Complexity (2024)Soyeong Jeong, Jinheon Baek, Sukmin Cho, et al.29.58
- Prometheus: Inducing Fine-grained Evaluation Capability In Language Models (2023)Seungone Kim, Jamin Shin, Yejin Cho, et al.28.76
- Bigcodebench: Benchmarking Code Generation With Diverse Function Calls And Complex Instructions (2024)Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, et al.28.58
- Trustworthy Llms: A Survey And Guideline For Evaluating Large Language Models' Alignment (2023)Yang Liu, Yuanshun Yao, Jean-Francois Ton, et al.27.94
- Agentless: Demystifying Llm-based Software Engineering Agents (2024)Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, et al.27.93
- Chain-of-verification Reduces Hallucination In Large Language Models (2023)Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, et al.27.71
- Contrastive Preference Optimization: Pushing The Boundaries Of LLM Performance In Machine Translation (2024)Haoran Xu, Amr Sharaf, Yunmo Chen, et al.27.70
- Multihop-rag: Benchmarking Retrieval-augmented Generation For Multi-hop Queries (2024)Yixuan Tang, Yi Yang27.60
- Aya Model: An Instruction Finetuned Open-access Multilingual Language Model (2024)Ahmet ΓstΓΌn, Viraat Aryabumi, Zheng-Xin Yong, et al.27.57
- Livecodebench: Holistic And Contamination Free Evaluation Of Large Language Models For Code (2024)Naman Jain, King Han, Alex Gu, et al.27.45
- The Reversal Curse: Llms Trained On "A Is B" Fail To Learn "B Is A" (2023)Lukas Berglund, Meg Tong, Max Kaufmann, et al.27.32
- Llama Guard: Llm-based Input-output Safeguard For Human-ai Conversations (2023)Hakan Inan, Kartikeya Upasani, Jianfeng Chi, et al.27.06
- Gsm-symbolic: Understanding The Limitations Of Mathematical Reasoning In Large Language Models (2024)Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, et al.27.04
- Internlm2 Technical Report (2024)Zheng Cai, Maosong Cao, Haojiong Chen, et al.27.03
- Longalign: A Recipe For Long Context Alignment Of Large Language Models (2024)Yushi Bai, Xin Lv, Jiajie Zhang, et al.26.93
- Dola: Decoding By Contrasting Layers Improves Factuality In Large Language Models (2023)Yung-Sung Chuang, Yujia Xie, Hongyin Luo, et al.26.86
- Evaluating Large Language Models: A Comprehensive Survey (2023)Zishan Guo, Renren Jin, Chuang Liu, et al.26.82
- Lmsys-chat-1m: A Large-scale Real-world LLM Conversation Dataset (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al.26.82
- Datadreamer: A Tool For Synthetic Data Generation And Reproducible LLM Workflows (2024)Ajay Patel, Colin Raffel, Chris Callison-Burch26.77
- Replacing Judges With Juries: Evaluating LLM Generations With A Panel Of Diverse Models (2024)Pat Verga, Sebastian Hofstatter, Sophia Althammer, et al.26.69
- Bias And Fairness In Large Language Models: A Survey (2023)Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, et al.26.54
- Promptbench: A Unified Library For Evaluation Of Large Language Models (2023)Kaijie Zhu, Qinlin Zhao, Hao Chen, et al.26.53
- Saullm-7b: A Pioneering Large Language Model For Law (2024)Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, et al.26.00
- Mammoth: Building Math Generalist Models Through Hybrid Instruction Tuning (2023)Xiang Yue, Xingwei Qu, Ge Zhang, et al.25.94
- Babilong: Testing The Limits Of Llms With Long Context Reasoning-in-a-haystack (2024)Yuri Kuratov, Aydar Bulatov, Petr Anokhin, et al.25.74
- Fine-tuning Language Models For Factuality (2023)Katherine Tian, Eric Mitchell, Huaxiu Yao, et al.25.62
- Freshllms: Refreshing Large Language Models With Search Engine Augmentation (2023)Tu Vu, Mohit Iyyer, Xuezhi Wang, et al.25.58
- Llms Know More Than They Show: On The Intrinsic Representation Of LLM Hallucinations (2024)Hadas Orgad, Michael Toker, Zorik Gekhman, et al.25.38
- Detoxifying Large Language Models Via Knowledge Editing (2024)Mengru Wang, Ningyu Zhang, Ziwen Xu, et al.25.02
- Judging The Judges: Evaluating Alignment And Vulnerabilities In Llms-as-judges (2024)Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, et al.24.88
- Ragas: Automated Evaluation Of Retrieval Augmented Generation (2023)Shahul Es, Jithin James, Luis Espinosa-Anke, et al.24.85
- Siren's Song In The AI Ocean: A Survey On Hallucination In Large Language Models (2023)Yue Zhang, Yafu Li, Leyang Cui, et al.24.78
- Reconcile: Round-table Conference Improves Reasoning Via Consensus Among Diverse Llms (2023)Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal24.61
- Detecting Pretraining Data From Large Language Models (2023)Weijia Shi, Anirudh Ajith, Mengzhou Xia, et al.24.57
- Salad-bench: A Hierarchical And Comprehensive Safety Benchmark For Large Language Models (2024)Lijun Li, Bowen Dong, Ruohui Wang, et al.24.45
- From Crowdsourced Data To High-quality Benchmarks: Arena-hard And Benchbuilder Pipeline (2024)Tianle Li, Wei-Lin Chiang, Evan Frick, et al.24.39
- Wildbench: Benchmarking Llms With Challenging Tasks From Real Users In The Wild (2024)Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, et al.24.33
- Injecagent: Benchmarking Indirect Prompt Injections In Tool-integrated Large Language Model Agents (2024)Qiusi Zhan, Zhixiang Liang, Zifan Ying, et al.24.31
- Rephrase And Respond: Let Large Language Models Ask Better Questions For Themselves (2023)Yihe Deng, Weitong Zhang, Zixiang Chen, et al.24.30
- A Wolf In Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily (2023)Peng Ding, Jun Kuang, Dan Ma, et al.24.05
- Platypus: Quick, Cheap, And Powerful Refinement Of Llms (2023)Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz24.03
- MMAU: A Holistic Benchmark Of Agent Capabilities Across Diverse Domains (2024)Guoli Yin, Haoping Bai, Shuang Ma, et al.23.95
- Patchscopes: A Unifying Framework For Inspecting Hidden Representations Of Language Models (2024)Asma Ghandeharioun, Avi Caciularu, Adam Pearce, et al.23.87
- Chatglm-math: Improving Math Problem-solving In Large Language Models With A Self-critique Pipeline (2024)Yifan Xu, Xiao Liu, Xinghan Liu, et al.23.84
- Routerbench: A Benchmark For Multi-llm Routing System (2024)Qitian Jason Hu, Jacob Bieker, Xiuyu Li, et al.23.73
- Tinybenchmarks: Evaluating Llms With Fewer Examples (2024)Felipe Maia Polo, Lucas Weber, Leshem Choshen, et al.23.68
- Meta-rewarding Language Models: Self-improving Alignment With Llm-as-a-meta-judge (2024)Tianhao Wu, Weizhe Yuan, Olga Golovneva, et al.23.63
- Same Task, More Tokens: The Impact Of Input Length On The Reasoning Performance Of Large Language Models (2024)Mosh Levy, Alon Jacoby, Yoav Goldberg23.56
- Table-gpt: Table-tuned GPT For Diverse Table Tasks (2023)Peng Li, Yeye He, Dror Yashar, et al.23.44