← all datasets

M-3ToolEval

Emerging
2papers using it
2025first seen

The M3ToolEval dataset/benchmark contains a set of tasks designed to evaluate the reliability of tool-use agents in code generation, focusing on their ability to adhere to inter-tool contracts and produce correct outputs without execution attempts.

Papers using M-3ToolEval (2)

M-3ToolEval β€” datasets β€” ai-agents