CRUXEval
Canonical10papers using it
3,193HF downloads
21HF likes
2024first seen
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation π Home Page β’ π» GitHub Repository β’ π Leaderboard β’ π Sample Explorer CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark of 800 Python functions and input-output pairs. The benchmark consists of two tasks, CRUXEval-I (i
π€ Hugging Faceβ mit
Papers using CRUXEval (10)
- SpecEval: Evaluating Code Comprehension in Large Language Models via
Program SpecificationsHow Robustly do LLMs Understand Execution Semantics?Towards a Neural Debugger for PythonSTEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution ReasoningAre Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?What I cannot execute, I do not understand: Training and Evaluating LLMs
on Program Execution TracesA Tool for In-depth Analysis of Code Execution Reasoning of Large
Language ModelsCRUXEval: A Benchmark for Code Reasoning, Understanding and ExecutionCRUXEval: A Benchmark for Code Reasoning, Understanding and ExecutionTowards a Neural Debugger for Python