CRUXEval

Canonical

10papers using it

2024first seen

CRUXEval: Code Reasoning, Understanding, and Execution Evaluation 🏠 Home Page • 💻 GitHub Repository • 🏆 Leaderboard • 🔎 Sample Explorer CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark of 800 Python functions and input-output pairs. The benchmark consists of two tasks, CRUXEval-I (i

🔎 Find this dataset

Papers using CRUXEval (10)

A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models2025 · 1 cites

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications2024 · 1 cites

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning2026

How Robustly do LLMs Understand Execution Semantics?2026

Towards a Neural Debugger for Python2026

Generating Verifiable Chain of Thoughts from Exection-Traces2025

STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning2025

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?2025

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces2025

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution2024 · 5 cites