SAC-GLAM: Improving Online RL For LLM Agents With Soft Actor-critic And Hindsight Relabeling
2024 Β· Loris Gaven, Clement Romac, Thomas Carta, et al.
Abstract
The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outper
Authors
(none)
Tags
Stats
Related papers
- Sample-efficient Online Learning In LM Agents Via Hindsight Trajectory Rewriting (2025)1.57
- Towards Agentic Self-learning Llms In Search Environment (2025)0.00
- Language Agents With Reinforcement Learning For Strategic Play In The Werewolf Game (2023)0.00
- MARSHAL: Incentivizing Multi-agent Reasoning Via Self-play With Strategic Llms (2025)0.00
- Himac: Hierarchical Macro-micro Learning For Long-horizon LLM Agents (2026)0.00
- From Laws To Motivation: Guiding Exploration Through Law-based Reasoning And Rewards (2024)0.00
- True Knowledge Comes From Practice: Aligning Llms With Embodied Environments Via Reinforcement Learning (2024)0.00
- Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success (2025)0.00