Offline Behavior Distillation
2024 Β· Shiye Lei, Sen Zhang, Dacheng Tao
Abstract
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions, but the large data volume can cause training inefficiencies. To tackle this issue, we formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data, enabling rapid policy learning. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy. Due to intractable bi-level optimization, the OBD objective is difficult to minimize to small values, which deteriorates PBC by its distillation performance guarantee with quadratic discount complexity \(\mathcal\{O\}(1/(1-\gamma)^2)\). We theoretically establish the equivalence between the policy performance and action-value weighted decision difference, and introduce action-value weighted PBC (Av-PBC) as a more effectiv
Authors
(none)
Tags
Stats
Related papers
- Improving TD3-BC: Relaxed Policy Constraint For Offline Learning And Stable Online Fine-tuning (2022)0.00
- Online Policy Distillation With Decision-attention (2024)0.00
- Offline Behavioral Data Selection (2025)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- Reliable Conditioning Of Behavioral Cloning For Offline Reinforcement Learning (2022)0.00
- Constrained Policy Optimization With Explicit Behavior Density For Offline Reinforcement Learning (2023)0.00
- Diffusion Policies With Value-conditional Optimization For Offline Reinforcement Learning (2025)0.00
- B3C: A Minimalist Approach To Offline Multi-agent Reinforcement Learning (2025)0.00