In-dataset Trajectory Return Regularization For Offline Preference-based Reinforcement Learning
2024 Β· Songjun Tu, Jingbo Sun, Qichao Zhang, et al.
Abstract
Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns a
Authors
(none)
Tags
Stats
Related papers
- Dueling RL: Reinforcement Learning With Trajectory Preferences (2021)0.00
- Harnessing Mixed Offline Reinforcement Learning Datasets Via Trajectory Weighting (2023)0.00
- Provably Efficient Offline Reinforcement Learning With Trajectory-wise Reward (2022)0.00
- Listwise Reward Estimation For Offline Preference-based Reinforcement Learning (2024)0.00
- Bitrajdiff: Bidirectional Trajectory Generation With Diffusion Models For Offline Reinforcement Learning (2025)0.00
- Offline Safe Reinforcement Learning Using Trajectory Classification (2024)0.00
- Model-based Trajectory Stitching For Improved Offline Reinforcement Learning (2022)0.00
- Policy Regularization With Dataset Constraint For Offline Reinforcement Learning (2023)0.00