Evaluation
50 papers tagged Evaluation โ re-sort below
Papers
- R-Zero: Self-Evolving Reasoning LLM from Zero Data (2025)Chengsong Huang et al.19.83
- Agentic Reinforced Policy Optimization (2025)Guanting Dong et al.19.61
- R1-Searcher: Incentivizing the Search Capability in LLMs via
Reinforcement Learning (2025)Huatong Song et al.18.73
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025)Haozhan Li et al.18.40
- Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL (2025)Weizhen Li et al.18.23
- InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners (2025)Yuhang Liu et al.16.27
- SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning (2025)Zhongwei Wan et al.16.17
- QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning (2025)Fanqi Wan et al.16.01
- EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments (2026)Jundong Xu et al.16.01
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning (2026)Seokju Cho et al.15.70
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2025)Hongli Yu et al.15.38
- AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security (2026)Dongrui Liu et al.14.78
- APPO: Agentic Procedural Policy Optimization (2026)Xucong Wang et al.14.60
- FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents (2026)Jia Deng et al.14.20
- Kimi K2.5: Visual Agentic Intelligence (2026)Kimi Team: Tongtong Bai et al.13.84
- Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution (2026)Xucong Wang et al.13.67
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories (2026)Baochang Ren et al.13.59
- AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration (2026)Jiaqi Liu et al.13.52
- Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks (2026)Mengyu Zheng et al.13.33
- ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? (2026)Woojung Song et al.13.24
- Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application (2026)Jiachun Li et al.13.22
- Heterogeneous Agent Collaborative Reinforcement Learning (2026)Zhixia Zhang et al.13.18
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research (2026)Dingbang Wu et al.12.91
- AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints (2026)Jiayu Liu et al.12.83
- AWorld: Orchestrating the Training Recipe for Agentic AI (2025)Chengyue Yu et al.12.78
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories (2026)Shaoqiu Zhang et al.12.73
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025)DeepSeek-AI et al.12.70
- COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation (2026)Tianyi Zhou et al.12.70
- K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts (2026)Nahyun Lee et al.12.62
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills (2026)Yifan Yang et al.12.58
- From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI (2026)Yongheng Zhang et al.12.47
- LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning (2026)Yifan Dai et al.12.42
- ResearchMath-14K: Scaling Research-Level Mathematics via Agents (2026)Guijin Son et al.12.37
- Planning, Creation, Usage: Benchmarking Llms For Comprehensive Tool Utilization In Real-world Complex Scenarios (2024)Shijue Huang, Wanjun Zhong, Jianqiao Lu, et al.12.31
- LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards (2026)Nianyi Lin et al.12.21
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (2026)Yibo Wang et al.12.15
- Self-Improving Language Models with Bidirectional Evolutionary Search (2026)Guowei Xu et al.12.10
- GrepSeek: Training Search Agents for Direct Corpus Interaction (2026)Alireza Salemi et al.12.06
- Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents (2026)Suji Kim et al.12.03
- HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry (2026)Tingyang Chen et al.11.97
- Agent models: Internalizing Chain-of-Action Generation into Reasoning
models (2025)Yuxiang Zhang et al.11.87
- Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution (2025)Tianrui Qin et al.11.85
- MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents (2025)Kunlun Zhu et al.11.84
- QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks (2026)Jian Xie et al.11.71
- World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning (2026)Yucheng Zhou et al.11.70
- Latent Collaboration in Multi-Agent Systems (2025)Jiaru Zou et al.11.66
- It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs (2026)Sangwoo Park et al.11.64
- Relay Hindsight Experience Replay: Self-guided Continual Reinforcement Learning For Sequential Object Manipulation Tasks With Sparse Rewards (2022)Yongle Luo, Yuxin Wang, Kun Dong, et al.11.58
- Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning (2025)Haoran Luo et al.11.55
- EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery (2026)Amy Xin et al.11.55