BenchTrace

Emerging

2papers using it

2026first seen

BenchTrace is a benchmark containing a snapshot-reflection dataset of 1,821 annotated episodes across six tasks, used to evaluate the self-evolution ability of LLM agents through reflection quality and failure avoidance behavior.

🔎 Find this dataset

Papers using BenchTrace (2)

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents2026

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents2026