Benchmarks
50 papers tagged Benchmarks โ re-sort below
Papers
- R-Zero: Self-Evolving Reasoning LLM from Zero Data (2025)Chengsong Huang et al.19.83
- Agentic Reinforced Policy Optimization (2025)Guanting Dong et al.19.61
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025)Haozhan Li et al.18.40
- SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning (2025)Zhongwei Wan et al.16.17
- QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning (2025)Fanqi Wan et al.16.01
- EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments (2026)Jundong Xu et al.16.01
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning (2026)Seokju Cho et al.15.70
- MiniMax Sparse Attention (2026)Xunhao Lai et al.15.51
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2025)Hongli Yu et al.15.38
- Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (2026)Qiuyue Wang et al.15.10
- APPO: Agentic Procedural Policy Optimization (2026)Xucong Wang et al.14.60
- FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents (2026)Jia Deng et al.14.20
- Kimi K2.5: Visual Agentic Intelligence (2026)Kimi Team: Tongtong Bai et al.13.84
- Agent Explorative Policy Optimization for Multimodal Agentic Reasoning (2026)Minki Kang et al.13.75
- Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution (2026)Xucong Wang et al.13.67
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks (2026)Mengyu Zheng et al.13.33
- ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? (2026)Woojung Song et al.13.24
- Heterogeneous Agent Collaborative Reinforcement Learning (2026)Zhixia Zhang et al.13.18
- ACC: Compiling Agent Trajectories for Long-Context Training (2026)Qisheng Su et al.13.06
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research (2026)Dingbang Wu et al.12.91
- AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints (2026)Jiayu Liu et al.12.83
- AWorld: Orchestrating the Training Recipe for Agentic AI (2025)Chengyue Yu et al.12.78
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories (2026)Shaoqiu Zhang et al.12.73
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025)DeepSeek-AI et al.12.70
- K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts (2026)Nahyun Lee et al.12.62
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills (2026)Yifan Yang et al.12.58
- LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning (2026)Yifan Dai et al.12.42
- ResearchMath-14K: Scaling Research-Level Mathematics via Agents (2026)Guijin Son et al.12.37
- Streaming Communication in Multi-Agent Reasoning (2026)Zhen Yang et al.12.32
- Planning, Creation, Usage: Benchmarking Llms For Comprehensive Tool Utilization In Real-world Complex Scenarios (2024)Shijue Huang, Wanjun Zhong, Jianqiao Lu, et al.12.31
- LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards (2026)Nianyi Lin et al.12.21
- Orchestra-o1: Omnimodal Agent Orchestration (2026)Fan Zhang et al.12.21
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (2026)Yibo Wang et al.12.15
- Self-Improving Language Models with Bidirectional Evolutionary Search (2026)Guowei Xu et al.12.10
- HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry (2026)Tingyang Chen et al.11.97
- Agent models: Internalizing Chain-of-Action Generation into Reasoning
models (2025)Yuxiang Zhang et al.11.87
- CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? (2026)Haolin Chen et al.11.86
- Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution (2025)Tianrui Qin et al.11.85
- MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents (2025)Kunlun Zhu et al.11.84
- SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research (2026)Pu Ning et al.11.78
- QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks (2026)Jian Xie et al.11.71
- World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning (2026)Yucheng Zhou et al.11.70
- Latent Collaboration in Multi-Agent Systems (2025)Jiaru Zou et al.11.66
- Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning (2025)Haoran Luo et al.11.55
- Rethinking Memory as Continuously Evolving Connectivity (2026)Jizhan Fang et al.11.49
- Personal AI Agent for Camera Roll VQA (2026)Thao Nguyen et al.11.45
- Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism (2026)Haoxiang Zhang et al.11.44
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows (2026)Haoran Zhang et al.11.23
- Cosmos 3: Omnimodal World Models for Physical AI (2026)NVIDIA et al.11.10