Pref-guide: Continual Policy Learning From Real-time Human Feedback Via Preference-based Learning
2025 Β· Zhengran Ji, Boyuan Chen
Abstract
Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating rew
Authors
(none)
Tags
Stats
Related papers
- Directed Policy Gradient For Safe Reinforcement Learning With Human Advice (2018)0.00
- Symbol Guided Hindsight Priors For Reward Learning From Human Preferences (2022)0.00
- Reinforcement Learning From Diverse Human Preferences (2023)0.00
- Tell Me Why: Training Preferences-based RL With Human Preferences And Step-level Explanations (2024)0.00
- Co-evolution Of Policy And Internal Reward For Language Agents (2026)0.00
- Optimizing Latent Goal By Learning From Trajectory Preference (2024)0.00
- Influencing Reinforcement Learning Through Natural Language Guidance (2021)0.00
- Dynamic Policy Fusion For User Alignment Without Re-interaction (2024)0.00