Awesome Papers

🆕 New

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang et al.

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its eff

💛 197

🆕 New

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Guochao Jiang et al.

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimizatio

💛 123

🆕 New

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei et al.

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry o

💛 45

Safe Continual Reinforcement Learning In Non-stationary Environments

🆕 New

Safe Continual Reinforcement Learning In Non-stationary Environments

Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro, et al.

Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most e…

📚 20

🆕 New

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang et al.

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer

💛 20

🆕 New

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Xinchen Zhang et al.

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation m

💛 9

🆕 New

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Hongru Hou et al.

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL

💛 7

🆕 New

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Hao Jiang et al.

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they of

💛 5

🆕 New

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Zili Wang et al.

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token P

💛 4

🆕 New

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Utkarsh Tyagi et al.

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behavio

💛 4

🆕 New

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

Branislav Kveton et al.

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimiz

🆕 New

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

Damian Lebiedź et al.

This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cr

🆕 New

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Xiaozhe Li et al.

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution

💛 1

🆕 New

Position: Deployed Reinforcement Learning should be Continual

Parnian Behdin et al.

Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained a

🆕 New

Large Language Models Hack Rewards, and Society

Wei Liu et al.

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations

🆕 New

Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning

Dimitris Michailidis et al.

We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel d

🆕 New

Exact Unlearning in Reinforcement Learning

Thanh Nguyen-Tang et al.

We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's da

🆕 New

Dual Advantage Fields

Alexey Zemtsov et al.

Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fie

🆕 New

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

Saket Tiwari et al.

We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, d

🆕 New

Policy Gradient for Continuous-Time Robust Markov Decision Processes

Tanya Veeravalli et al.

The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transitio

🆕 New

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

Guangcheng Zhu et al.

Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset

🆕 New

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Gyeongtae Yoo et al.

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a

🆕 New

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Xuekang Wang et al.

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent bi

🆕 New

Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

Marc Walden et al.

We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we int

📚 Awesome Reinforcement Learning