Free Energy-driven Reinforcement Learning With Adaptive Advantage Shaping For Unsupervised Reasoning In Llms
2026 Β· Yiming Huang, Zhenbo Shi, Xin-Cheng Wen, et al.
Abstract
arXiv:2605.04065v1 Announce Type: cross Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other
Authors
(none)
Tags
Stats
Related papers
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Reinforcement Learning In The Era Of Llms: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, And Beyond (2023)0.00
- Whatever Remains Must Be True: Filtering Drives Reasoning In Llms, Shaping Diversity (2025)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- From Laws To Motivation: Guiding Exploration Through Law-based Reasoning And Rewards (2024)0.00
- Adaptive Reward Design For Reinforcement Learning (2024)0.00
- Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment (2026)0.00
- MARSHAL: Incentivizing Multi-agent Reasoning Via Self-play With Strategic Llms (2025)0.00