← all datasets

RUT-Bench

Emerging
3papers using it
79HF downloads
0HF likes
2025first seen

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions This repository contains the RUT-Bench benchmark, which consists of 1638 test samples for evaluating LLM agents under realistic user interactions. Paper: Beyond Ideal Instruction: A Comprehensive Framework for Evaluating L

Papers using RUT-Bench (3)

RUT-Bench β€” datasets β€” ai-for-code