Awesome Artificial Intelligence
Artificial Intelligence is one of the most active areas in Awesome AI Agents β 60 papers in this collection, evaluated on datasets like LIBERO, AgentBench, RoboTwin 2.0. A strong starting point is "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning".
Datasets & benchmarks
Key papers
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)DeepSeek-AI et al.21.94
- Arcee's Mergekit: A Toolkit For Merging Large Language Models (2024)Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, et al.18.99
- Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL (2025)Weizhen Li et al.18.23
- SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning (2025)Zhongwei Wan et al.16.17
- OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data (2026)Jiwen Liu et al.14.82
- Toward Generalist Autonomous Research via Hypothesis-Tree Refinement (2026)Jiajie Jin et al.14.49
- Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution (2026)Liliana Hotsko et al.12.93
- FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention (2026)Yan Wang et al.12.92
- SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks (2026)Hongcheng Gao et al.12.37
- HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers (2026)Guozhen Zhang et al.12.25
- Rethinking RAG in Long Videos: What to Retrieve and How to Use It? (2026)Yuho Lee et al.11.90
- MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents (2025)Kunlun Zhu et al.11.84
- EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents (2026)Weixian Xu et al.11.34
- Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World (2026)Yusong Lin et al.10.88
- DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off (2026)Xiaofan Li et al.10.80
- MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery (2026)Shangheng Du et al.10.72
- From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills (2026)Zisu Huang et al.10.54
- SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills (2026)Yingtie Lei et al.10.29
- One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA (2026)Zhi Zheng et al.10.21
- QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents (2026)Ye Yuan et al.10.15
- Doremi: Grounding Language Model By Detecting And Recovering From Plan-execution Misalignment (2023)Yanjiang Guo, Yen-Jen Wang, Lihan Zha, et al.9.92
- Unsupervised Skill Discovery for Agentic Data Analysis (2026)Zhisong Qiu et al.9.90
- HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs (2026)Yansong Ning et al.9.84
- When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents (2026)Dongsheng Zhu et al.9.73
- MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining (2025)LLM-Core Xiaomi: Bingquan Xia et al.9.66
- Adaptive-solver Framework For Dynamic Strategy Selection In Large Language Model Reasoning (2023)Jianpeng Zhou, Wanjun Zhong, Yanlin Wang, et al.9.50
- ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions (2026)Chuanyang Jin et al.9.41
- ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism (2025)Jia Liu et al.8.67
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training (2026)Yuanda Xu et al.8.60
- Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders (2026)Yi Jing et al.8.60
- Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring (2026)Seongheon Park et al.8.51
- TuRTLe: A Unified Evaluation of LLMs for RTL Generation (2025)Dario Garcia-Gasulla et al.8.39
- Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification (2025)Xu Xu et al.8.37
- Language-conditioned Imitation Learning With Base Skill Priors Under Unstructured Data (2023)Hongkuan Zhou, Zhenshan Bing, Xiangtong Yao, et al.8.35
- Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior (2026)Rafal Kocielnik et al.8.24
- OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework (2024)Jian Hu et al.7.89
- Towards Autonomous Mechanistic Reasoning in Virtual Cells (2026)Yunhui Jang et al.7.58
- optimize_anything: A Universal API for Optimizing any Text Parameter (2026)Lakshya A Agrawal et al.7.30
- HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance (2025)Ziyue Li et al.7.17
- PATS: Process-Level Adaptive Thinking Mode Switching (2025)Yi Wang et al.7.13
- Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs (2025)Jaemin Kim et al.7.11
- MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control (2026)Yuchi Wang et al.7.04
- PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning (2025)Wenfeng Feng et al.6.97
- Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing (2025)Jinwei Hu et al.6.91
- ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning (2025)Hanyang Chen et al.6.67
- Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization (2026)Jiarui Yao et al.6.67
- Multi-Agent Causal Discovery Using Large Language Models (2024)Hao Duong Le et al.6.56
- ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces (2026)Jinu Lee et al.6.50
- Multi-Turn Code Generation Through Single-Step Rewards (2025)Arnav Kumar Jain and Gonzalo Gonzalez-Pumariega and Wayne Chen and Alexander M Rush and Wenting Zhao and Sanjiban Choudhury6.43
- REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks (2025)Longling Geng and Edward Y. Chang6.41
- AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms (2025)Zhenxing Xu et al.6.29
- Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets (2025)Adam Younsi et al.6.28
- Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems (2024)Shengming Zhao et al.6.19
- Rewarding The Scientific Process: Process-level Reward Modeling For Agentic Data Analysis (2026)Zhisong Qiu, Shuofei Qiao, Kewei Xu, et al.5.90
- LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI (2026)Lalit Yadav et al.5.88
- ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis (2026)Guohong Liu et al.5.82
- Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning (2025)Xin Qiu et al.5.71
- Graph of States: Solving Abductive Tasks with Large Language Models (2026)Yu Luo et al.5.69
- Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration (2026)Haq Nawaz Malik et al.5.49
- Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans (2024)Yu Lei et al.5.40