M-3ToolEval
Emerging2papers using it
2025first seen
The M3ToolEval dataset/benchmark contains a set of tasks designed to evaluate the reliability of tool-use agents in code generation, focusing on their ability to adhere to inter-tool contracts and produce correct outputs without execution attempts.