Awesome Reinforcement Learning
Reinforcement Learning is one of the most active areas in Awesome LLM Papers β 1,286 papers in this collection, evaluated on datasets like AIME-24, AIME-25, MATH. A strong starting point is "Agenttuning: Enabling Generalized Agent Abilities For Llms".
Datasets & benchmarks
Key papers
- Agenttuning: Enabling Generalized Agent Abilities For Llms (2023)Aohan Zeng, Mingdao Liu, Rui Lu, et al.32.67
- Step-dpo: Step-wise Preference Optimization For Long-chain Reasoning Of Llms (2024)Xin Lai, Zhuotao Tian, Yukang Chen, et al.31.31
- Agentverse: Facilitating Multi-agent Collaboration And Exploring Emergent Behaviors (2023)Weize Chen, Yusheng Su, Jingwei Zuo, et al.29.70
- Helpsteer2: Open-source Dataset For Training Top-performing Reward Models (2024)Zhilin Wang, Yi Dong, Olivier Delalleau, et al.29.68
- Agentscope: A Flexible Yet Robust Multi-agent Platform (2024)Dawei Gao, Zitao Li, Xuchen Pan, et al.29.54
- RLAIF Vs. RLHF: Scaling Reinforcement Learning From Human Feedback With AI Feedback (2023)Harrison Lee, Samrat Phatale, Hassan Mansoor, et al.29.41
- Eureka: Human-level Reward Design Via Coding Large Language Models (2023)Yecheng Jason Ma, William Liang, Guanzhi Wang, et al.27.82
- Is DPO Superior To PPO For LLM Alignment? A Comprehensive Study (2024)Shusheng Xu, Wei Fu, Jiaxuan Gao, et al.27.53
- Agentgym: Evolving Large Language Model-based Agents Across Diverse Environments (2024)Zhiheng Xi, Yiwen Ding, Wenxiang Chen, et al.27.20
- The Unlocking Spell On Base Llms: Rethinking Alignment Via In-context Learning (2023)Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, et al.26.03
- Direct Nash Optimization: Teaching Language Models To Self-improve With General Preferences (2024)Corby Rosset, Ching-An Cheng, Arindam Mitra, et al.25.73
- Teaching Large Language Models To Reason With Reinforcement Learning (2024)Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, et al.25.08
- Agentohana: Design Unified Data And Training Pipeline For Effective Agent Learning (2024)Jianguo Zhang, Tian Lan, Rithesh Murthy, et al.24.90
- Self-play With Execution Feedback: Improving Instruction-following Capabilities Of Large Language Models (2024)Guanting Dong, Keming Lu, Chengpeng Li, et al.24.59
- Personalized Soups: Personalized Large Language Model Alignment Via Post-hoc Parameter Merging (2023)Joel Jang, Seungone Kim, Bill Yuchen Lin, et al.24.50
- Federatedscope-llm: A Comprehensive Package For Fine-tuning Large Language Models In Federated Learning (2023)Weirui Kuang, Bingchen Qian, Zitao Li, et al.24.26
- Nash Learning From Human Feedback (2023)RΓ©mi Munos, Michal Valko, Daniele Calandriello, et al.23.79
- Meta-rewarding Language Models: Self-improving Alignment With Llm-as-a-meta-judge (2024)Tianhao Wu, Weizhe Yuan, Olga Golovneva, et al.23.63
- Self-exploring Language Models: Active Preference Elicitation For Online Alignment (2024)Shenao Zhang, Donghan Yu, Hiteshi Sharma, et al.23.63
- Expel: LLM Agents Are Experiential Learners (2023)Andrew Zhao, Daniel Huang, Quentin Xu, et al.23.10
- Stepcoder: Improve Code Generation With Reinforcement Learning From Compiler Feedback (2024)Shihan Dou, Yan Liu, Haoxiang Jia, et al.22.83
- Reinforced Self-training (rest) For Language Modeling (2023)Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, et al.22.75
- LLM Evaluators Recognize And Favor Their Own Generations (2024)Arjun Panickssery, Samuel R. Bowman, Shi Feng22.26
- Routing To The Expert: Efficient Reward-guided Ensemble Of Large Language Models (2023)Keming Lu, Hongyi Yuan, Runji Lin, et al.21.60
- Smartplay: A Benchmark For Llms As Intelligent Agents (2023)Yue Wu, Xuan Tang, Tom M. Mitchell, et al.21.32
- Vineppo: Refining Credit Assignment In RL Training Of Llms (2024)Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, et al.21.27
- Understanding The Effects Of RLHF On LLM Generalisation And Diversity (2023)Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, et al.21.24
- Alphazero-like Tree-search Can Guide Large Language Model Decoding And Training (2023)Xidong Feng, Ziyu Wan, Muning Wen, et al.21.21
- RLEF: Grounding Code Llms In Execution Feedback With Reinforcement Learning (2024)Jonas Gehring, Kunhao Zheng, Jade Copet, et al.21.18
- Interpretable Preferences Via Multi-objective Reward Modeling And Mixture-of-experts (2024)Haoxiang Wang, Wei Xiong, Tengyang Xie, et al.20.64
- Recursive Introspection: Teaching Language Model Agents How To Self-improve (2024)Yuxiao Qu, Tianjun Zhang, Naman Garg, et al.19.75
- RAIN: Your Language Models Can Align Themselves Without Finetuning (2023)Yuhui Li, Fangyun Wei, Jinjing Zhao, et al.19.73
- LASER: LLM Agent With State-space Exploration For Web Navigation (2023)Kaixin Ma, Hongming Zhang, Hongwei Wang, et al.19.71
- Optimization-based Prompt Injection Attack To Llm-as-a-judge (2024)Jiawen Shi, Zenghui Yuan, Yinuo Liu, et al.19.67
- Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models (2023)Ziniu Li, Tian Xu, Yushun Zhang, et al.19.63
- Efficient Exploration For Llms (2024)Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, et al.19.21
- Panacea: Pareto Alignment Via Preference Adaptation For Llms (2024)Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, et al.18.85
- Rewarding Progress: Scaling Automated Process Verifiers For LLM Reasoning (2024)Amrith Setlur, Chirag Nagpal, Adam Fisch, et al.18.76
- Preference Fine-tuning Of Llms Should Leverage Suboptimal, On-policy Data (2024)Fahim Tajwar, Anikait Singh, Archit Sharma, et al.18.58
- Navgpt-2: Unleashing Navigational Reasoning Capability For Large Vision-language Models (2024)Gengze Zhou, Yicong Hong, Zun Wang, et al.18.43
- Regularizing Hidden States Enables Learning Generalizable Reward Model For Llms (2024)Rui Yang, Ruomeng Ding, Yong Lin, et al.18.19
- Pku-saferlhf: Towards Multi-level Safety Alignment For Llms With Human Preference (2024)Jiaming Ji, Donghai Hong, Borong Zhang, et al.18.18
- Agentic Reward Modeling: Integrating Human Preferences With Verifiable Correctness Signals For Reliable Reward Systems (2025)Hao Peng, Yunjia Qi, Xiaozhi Wang, et al.17.49
- Improving Large Language Models Via Fine-grained Reinforcement Learning With Minimum Editing Constraint (2024)Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, et al.17.37
- Aligning Large Language Models With Human Preferences Through Representation Engineering (2023)Wenhao Liu, Xiaohua Wang, Muling Wu, et al.16.98
- Defending Large Language Models Against Jailbreak Attacks Via Layer-specific Editing (2024)Wei Zhao, Zhe Li, Yige Li, et al.16.94
- Reason For Future, Act For Now: A Principled Framework For Autonomous LLM Agents With Provable Sample Efficiency (2023)Zhihan Liu, Hao Hu, Shenao Zhang, et al.16.93
- When Is Tree Search Useful For LLM Planning? It Depends On The Discriminator (2024)Ziru Chen, Michael White, Raymond Mooney, et al.16.92
- Inferaligner: Inference-time Alignment For Harmlessness Through Cross-model Guidance (2024)Pengyu Wang, Dong Zhang, Linyang Li, et al.16.83
- Iterative Nash Policy Optimization: Aligning Llms With General Preferences Via No-regret Learning (2024)Yuheng Zhang, Dian Yu, Baolin Peng, et al.16.80
- Steerlm: Attribute Conditioned SFT As An (user-steerable) Alternative To RLHF (2023)Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, et al.16.77
- Formal-llm: Integrating Formal Language And Natural Language For Controllable Llm-based Agents (2024)Zelong Li, Wenyue Hua, Hao Wang, et al.16.74
- Archer: Training Language Model Agents Via Hierarchical Multi-turn RL (2024)Yifei Zhou, Andrea Zanette, Jiayi Pan, et al.16.61
- Self-explore: Enhancing Mathematical Reasoning In Language Models With Fine-grained Rewards (2024)Hyeonbin Hwang, Doyoung Kim, Seungone Kim, et al.16.59
- Let's Reward Step By Step: Step-level Reward Model As The Navigators For Reasoning (2023)Qianli Ma, Haotian Zhou, Tingkai Liu, et al.16.52
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)Tianhao Wu, Banghua Zhu, Ruoyu Zhang, et al.16.51
- Hiagent: Hierarchical Working Memory Management For Solving Long-horizon Agent Tasks With Large Language Model (2024)Mengkang Hu, Tianxing Chen, Qiguang Chen, et al.16.39
- Real: Efficient RLHF Training Of Large Language Models With Parameter Reallocation (2024)Zhiyu Mei, Wei Fu, Kaiwei Li, et al.16.34
- Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game (2023)Zelai Xu, Chao Yu, Fei Fang, et al.16.23
- Arithmetic Control Of Llms For Diverse User Preferences: Directional Preference Alignment With Multi-objective Rewards (2024)Haoxiang Wang, Yong Lin, Wei Xiong, et al.16.19