AIME-24
Emerging5papers using it
2024first seen
The AIME24 dataset/benchmark is used to evaluate the performance of Tool-Integrated Reasoning systems by providing a set of tasks that require strategic planning and self-correction through sequential tool invocation.
Papers using AIME-24 (5)
- DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement LearningWhat If We Allocate Test-Time Compute Adaptively?Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool CallingFirst Return, Entropy-Eliciting ExploreLightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with
Effortless Adaptation