Awesome Testing
Testing is one of the most active areas in Awesome AI for Code β 1,477 papers in this collection, evaluated on datasets like HumanEval, MBPP, LiveCodeBench. A strong starting point is "How Can ChatGPT Support Human Security Testers to Help Mitigate Supply Chain Attacks?".
Datasets & benchmarks
Key papers
- How Can ChatGPT Support Human Security Testers to Help Mitigate Supply Chain Attacks? (2023)Ying Zhang et al.8.58
- Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents (2026)Xiang Liu et al.6.46
- Fuzzing BusyBox: Leveraging LLM and Crash Reuse for Embedded Bug
Unearthing (2024)Asmita et al.6.02
- VRank: Enhancing Verilog Code Generation from Large Language Models via
Self-Consistency (2025)Zhuorui Zhao et al.5.90
- Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization (2026)Anmol Agarwal et al.5.81
- Fully Autonomous Programming using Iterative Multi-Agent Debugging with
Large Language Models (2025)Anastasiia Grishina and Vadim Liventsev and Aki H\"arm\"a and Leon Moonen5.65
- TOGLL: Correct and Strong Test Oracle Generation with LLMs (2024)Soneya Binta Hossain and Matthew Dwyer5.40
- Advancing Code Coverage: Incorporating Program Analysis with Large Language Models (2024)Chen Yang et al.5.34
- KernelGPT: Enhanced Kernel Fuzzing via Large Language Models (2024)Chenyuan Yang et al.5.18
- C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Techniques (2025)Vikram Nitin et al.5.18
- KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding (2025)Zhangchen Xu et al.5.04
- CodeContests+: High-Quality Test Case Generation for Competitive
Programming (2025)Zihan Wang et al.5.04
- Validating Network Protocol Parsers with Traceable RFC Document
Interpretation (2025)Mingwei Zheng et al.4.93
- Combining Language and App UI Analysis for the Automated Assessment of
Bug Reproduction Steps (2025)Junayed Mahmud et al.4.82
- Improving Deep Assertion Generation via Fine-Tuning Retrieval-Augmented
Pre-trained Language Models (2025)Quanjun Zhang et al.4.82
- LLM4CVE: Enabling Iterative Automated Vulnerability Repair with Large
Language Models (2025)Mohamad Fakih et al.4.76
- Test Wars: A Comparative Study of SBST, Symbolic Execution, and
LLM-Based Approaches to Unit Test Generation (2025)Azat Abdullin et al.4.76
- ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework (2025)Vali Tawosi et al.4.75
- GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation (2025)Guanyu Chen et al.4.58
- On the Challenges of Fuzzing Techniques via Large Language Models (2024)Linghan Huang et al.4.57
- FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation
for Feature Implementation (2025)Wei Li et al.4.53
- Revisiting VerilogEval: A Year of Improvements in Large-Language Models
for Hardware Code Generation (2024)Nathaniel Pinckney et al.4.48
- Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification (2024)Saketh Ram Kasibatla et al.4.47
- Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles (2025)Davide Molinelli et al.4.42
- From Rocq to Metal: A Pipeline for Formally Verified Microcontroller Firmware (2026)Valentin Bergeron et al.4.39
- SPOQ: Specialist Orchestrated Queuing for Multi-Agent Software Engineering (2026)Royce Carbowitz et al.4.39
- Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation (2026)Bagus Rakadyanto Oktavianto Putra et al.4.39
- Neural Change Prediction: Relating Software Changes to Their Effects and Vice Versa (2026)Laura Plein et al.4.39
- FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement (2026)Yinsheng Yao et al.4.39
- ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer (2026)Jintao Huang et al.4.39
- TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation (2026)Eric Spencer et al.4.39
- EstRTL: Functional Estimation Guided RTL Code Generation (2026)Qi Xiong et al.4.39
- Multi-task LLMs for Bug Classification: Efficient Inference with Auxiliary Decoding Heads (2026)Nikolai Rozanov4.39
- When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs (2026)Kaviru Hapuarachchi4.39
- Unlocking LLM Code Correction with Iterative Feedback Loops (2026)Le Zhang et al.4.39
- UTFix: Change Aware Unit Test Repairing using LLM (2025)Shanto Rahman et al.4.36
- Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions (2026)Matthew Kutakh4.33
- Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks (2026)Victor M. dos Santos et al.4.33
- Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation (2026)Zehua Pei et al.4.33
- SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? (2026)Hwiwon Lee et al.4.33
- LLM-based Mockless Unit Test Generation for Java (2026)Qinghua Xu et al.4.33
- Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning (2026)Yilong Li et al.4.33
- EviACT: An Evidence-to-Action Framework for Agentic Program Repair (2026)Qianru Meng et al.4.33
- AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications (2026)Yuchao Wu et al.4.33
- Agentic Separation Logic Specification Synthesis (2026)Tarun Suresh et al.4.33
- Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation (2026)Le Bronnec Florian et al.4.33
- Multi-Agent LLM-based Metamorphic Testing for REST APIs (2026)Shehroz Khan et al.4.33
- SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers (2026)Kaihua Qin et al.4.33
- Inferring Code Correctness from Specification (2026)Tambon Florian et al.4.33
- CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based
Verification (2024)Yuchen Tian et al.4.32
- AsserT5: Test Assertion Generation Using a Fine-Tuned Code Language
Model (2025)Severin Primbs et al.4.30
- LLM Agents Making Agent Tools (2025)Georg W\"olflein et al.4.30
- Requirements-Driven Automated Software Testing: A Systematic Review (2025)Fanyu Wang et al.4.30
- CoSQA+: Pioneering the Multi-Choice Code Search Benchmark with Test-Driven Agents (2024)Jing Gong et al.4.25
- Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing
with Large Language Models (2025)Kunpeng Zhang et al.4.25
- Correctness Assessment of Code Generated by Large Language Models Using Internal Representations (2025)Tuan-Dung Bui et al.4.25
- Faster Configuration Performance Bug Testing with Neural Dual-level
Prioritization (2025)Youpeng Ma et al.4.25
- o3-mini vs DeepSeek-R1: Which One is Safer? (2025)Aitor Arrieta et al.4.25
- Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents (2026)Shuvendu K. Lahiri4.09
- Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection (2025)Damian Gnieciak et al.3.97