Direct Soft-policy Sampling Via Langevin Dynamics
2026 Β· Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, et al.
Abstract
Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning
Authors
(none)
Tags
Stats
Related papers
- Sampling From Energy-based Policies Using Diffusion (2024)0.00
- Langevin Soft Actor-critic: Efficient Exploration Through Uncertainty-driven Critic Learning (2025)0.00
- Soft Policy Gradient Method For Maximum Entropy Deep Reinforcement Learning (2019)10.85
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00
- Robust Reinforcement Learning Via Adversarial Training With Langevin Dynamics (2020)0.00
- Diffusion Policies As An Expressive Policy Class For Offline Reinforcement Learning (2022)0.00
- Equivalence Between Policy Gradients And Soft Q-learning (2017)0.00
- Distributional Soft Actor-critic With Diffusion Policy (2025)0.00