Awesome Code
Code is one of the most active areas in Awesome LLM Papers β 1,487 papers in this collection, evaluated on datasets like LiveCodeBench, MMLU, AIME-24. A strong starting point is "Longbench: A Bilingual, Multitask Benchmark For Long Context Understanding".
Datasets & benchmarks
Key papers
- Longbench: A Bilingual, Multitask Benchmark For Long Context Understanding (2023)Yushi Bai, Xin Lv, Jiajie Zhang, et al.31.59
- Executable Code Actions Elicit Better LLM Agents (2024)Xingyao Wang, Yangyi Chen, Lifan Yuan, et al.31.13
- Autogen: Enabling Next-gen LLM Applications Via Multi-agent Conversation (2023)Qingyun Wu, Gagan Bansal, Jieyu Zhang, et al.29.16
- Bigcodebench: Benchmarking Code Generation With Diverse Function Calls And Complex Instructions (2024)Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, et al.28.58
- Agentless: Demystifying Llm-based Software Engineering Agents (2024)Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, et al.27.93
- Lm-infinite: Zero-shot Extreme Length Generalization For Large Language Models (2023)Chi Han, Qifan Wang, Hao Peng, et al.27.86
- Large Language Models For Software Engineering: A Systematic Literature Review (2023)Xinyi Hou, Yanjie Zhao, Yue Liu, et al.27.84
- Contrastive Preference Optimization: Pushing The Boundaries Of LLM Performance In Machine Translation (2024)Haoran Xu, Amr Sharaf, Yunmo Chen, et al.27.70
- Livecodebench: Holistic And Contamination Free Evaluation Of Large Language Models For Code (2024)Naman Jain, King Han, Alex Gu, et al.27.45
- Magicoder: Empowering Code Generation With Oss-instruct (2023)Yuxiang Wei, Zhe Wang, Jiawei Liu, et al.27.43
- Flashrag: A Modular Toolkit For Efficient Retrieval-augmented Generation Research (2024)Jiajie Jin, Yutao Zhu, Guanting Dong, et al.26.95
- Datadreamer: A Tool For Synthetic Data Generation And Reproducible LLM Workflows (2024)Ajay Patel, Colin Raffel, Chris Callison-Burch26.77
- A Survey On Large Language Models For Code Generation (2024)Juyong Jiang, Fan Wang, Jiasi Shen, et al.26.64
- Eagle: Exploring The Design Space For Multimodal Llms With Mixture Of Encoders (2024)Min Shi, Fuxiao Liu, Shihao Wang, et al.25.55
- Llms Know More Than They Show: On The Intrinsic Representation Of LLM Hallucinations (2024)Hadas Orgad, Michael Toker, Zorik Gekhman, et al.25.38
- LLM360: Towards Fully Transparent Open-source Llms (2023)Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, et al.24.01
- SPHINX-X: Scaling Data And Parameters For A Family Of Multi-modal Large Language Models (2024)Dongyang Liu, Renrui Zhang, Longtian Qiu, et al.23.22
- A Paradigm Shift In Machine Translation: Boosting Translation Performance Of Large Language Models (2023)Haoran Xu, Young Jin Kim, Amr Sharaf, et al.23.20
- Ml-bench: Evaluating Large Language Models And Agents For Machine Learning Tasks On Repository-level Code (2023)Xiangru Tang, Yuliang Liu, Zefan Cai, et al.23.09
- Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback (2024)Shihan Dou, Yan Liu, Haoxiang Jia, et al.22.83
- Ferret-v2: An Improved Baseline For Referring And Grounding With Large Language Models (2024)Haotian Zhang, Haoxuan You, Philipp Dufter, et al.22.68
- AST-T5: Structure-aware Pretraining For Code Generation And Understanding (2024)Linyuan Gong, Mostafa Elhoushi, Alvin Cheung21.95
- Llm4decompile: Decompiling Binary Code With Large Language Models (2024)Hanzhuo Tan, Qi Luo, Jing Li, et al.21.86
- Safedecoding: Defending Against Jailbreak Attacks Via Safety-aware Decoding (2024)Zhangchen Xu, Fengqing Jiang, Luyao Niu, et al.21.73
- Large Language Models For Compiler Optimization (2023)Chris Cummins, Volker Seeker, Dejan Grubisic, et al.21.62
- Verilogeval: Evaluating Large Language Models For Verilog Code Generation (2023)Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, et al.21.49
- If LLM Is The Wizard, Then Code Is The Wand: A Survey On How Code Empowers Large Language Models To Serve As Intelligent Agents (2024)Ke Yang, Jiateng Liu, John Wu, et al.21.32
- RLEF: Grounding Code Llms In Execution Feedback With Reinforcement Learning (2024)Jonas Gehring, Kunhao Zheng, Jade Copet, et al.21.18
- Red-teaming Large Language Models Using Chain Of Utterances For Safety-alignment (2023)Rishabh Bhardwaj, Soujanya Poria20.95
- How Abilities In Large Language Models Are Affected By Supervised Fine-tuning Data Composition (2023)Guanting Dong, Hongyi Yuan, Keming Lu, et al.20.76
- A Simple And Effective \(L_2\) Norm-based Strategy For KV Cache Compression (2024)Alessio Devoto, Yu Zhao, Simone Scardapane, et al.20.70
- CRAFT: Customizing Llms By Creating And Retrieving From Specialized Toolsets (2023)Lifan Yuan, Yangyi Chen, Xingyao Wang, et al.20.63
- From Code To Correctness: Closing The Last Mile Of Code Generation With Hierarchical Debugging (2024)Yuling Shi, Songsong Wang, Chengcheng Wan, et al.20.33
- Exploring The Role Of Large Language Models In Prompt Encoding For Diffusion Models (2024)Bingqi Ma, Zhuofan Zong, Guanglu Song, et al.19.55
- Frozen Transformers In Language Models Are Effective Visual Encoder Layers (2023)Ziqi Pang, Ziyang Xie, Yunze Man, et al.19.32
- Who Validates The Validators? Aligning Llm-assisted Evaluation Of LLM Outputs With Human Preferences (2024)Shreya Shankar, J. D. Zamfirescu-Pereira, BjΓΆrn Hartmann, et al.19.31
- SARATHI: Efficient LLM Inference By Piggybacking Decodes With Chunked Prefills (2023)Amey Agrawal, Ashish Panwar, Jayashree Mohan, et al.18.81
- Beyond Functional Correctness: Exploring Hallucinations In Llm-generated Code (2024)Fang Liu, Yang Liu, Lin Shi, et al.18.73
- Liger Kernel: Efficient Triton Kernels For LLM Training (2024)Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, et al.18.71
- Repairagent: An Autonomous, Llm-based Agent For Program Repair (2024)Islem Bouzenia, Premkumar Devanbu, Michael Pradel18.60
- Goex: Perspectives And Designs Towards A Runtime For Autonomous LLM Applications (2024)Shishir G. Patil, Tianjun Zhang, Vivian Fang, et al.18.57
- Mercury: A Code Efficiency Benchmark For Code Large Language Models (2024)Mingzhe Du, Anh Tuan Luu, Bin Ji, et al.18.43
- Astraios: Parameter-efficient Instruction Tuning Code Large Language Models (2024)Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, et al.18.02
- API-BLEND: A Comprehensive Corpora For Training And Benchmarking API Llms (2024)Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, et al.17.98
- Debug Like A Human: A Large Language Model Debugger Via Verifying Runtime Execution Step-by-step (2024)Li Zhong, Zilong Wang, Jingbo Shang17.98
- Llama-reviewer: Advancing Code Review Automation With Large Language Models Through Parameter-efficient Fine-tuning (2023)Junyi Lu, Lei Yu, Xiaojia Li, et al.17.90
- Openba: An Open-sourced 15B Bilingual Asymmetric Seq2seq Model Pre-trained From Scratch (2023)Juntao Li, Zecheng Tang, Yuyang Ding, et al.17.68
- Squeezed Attention: Accelerating Long Context Length LLM Inference (2024)Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, et al.17.60
- Ecoassistant: Using LLM Assistant More Affordably And Accurately (2023)Jieyu Zhang, Ranjay Krishna, Ahmed H. Awadallah, et al.17.46
- At Which Training Stage Does Code Data Help Llms Reasoning? (2023)Yingwei Ma, Yue Liu, Yue Yu, et al.17.45
- Exploring Parameter-efficient Fine-tuning Techniques For Code Generation With Large Language Models (2023)Martin Weyssow, Xin Zhou, Kisub Kim, et al.17.45
- UMBRELA: Umbrela Is The (open-source Reproduction Of The) Bing Relevance Assessor (2024)Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, et al.17.41
- Chainforge: A Visual Toolkit For Prompt Engineering And LLM Hypothesis Testing (2023)Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, et al.17.40
- Codescope: An Execution-based Multilingual Multitask Multidimensional Benchmark For Evaluating Llms On Code Understanding And Generation (2023)Weixiang Yan, Haitian Liu, Yunkun Wang, et al.17.37
- Debugbench: Evaluating Debugging Capability Of Large Language Models (2024)Runchu Tian, Yining Ye, Yujia Qin, et al.17.13
- Codeeditorbench: Evaluating Code Editing Capability Of Large Language Models (2024)Jiawei Guo, Ziming Li, Xueling Liu, et al.17.07
- Agent Skills For Large Language Models: Architecture, Acquisition, Security, And The Path Forward (2026)Renjun Xu, Yang Yan17.04
- Causal Parrots: Large Language Models May Talk Causality But Are Not Causal (2023)Matej ZeΔeviΔ, Moritz Willig, Devendra Singh Dhami, et al.16.91
- Can It Edit? Evaluating The Ability Of Large Language Models To Follow Code Editing Instructions (2023)Federico Cassano, Luisa Li, Akul Sethi, et al.16.72
- Xgrammar: Flexible And Efficient Structured Generation Engine For Large Language Models (2024)Yixin Dong, Charlie F. Ruan, Yaxing Cai, et al.16.58