Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards
2024 Β· Zhaohui Jiang, Xuening Feng, Paul Weng, et al.
Abstract
In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enf
Authors
(none)
Tags
Stats
Related papers
- Policy Improvement Reinforcement Learning (2026)0.00
- Provably Feedback-efficient Reinforcement Learning Via Active Reward Learning (2023)0.00
- When Errors Can Be Beneficial: A Categorization Of Imperfect Rewards For Policy Gradient (2026)0.00
- Reinforcement Learning From Diverse Human Preferences (2023)0.00
- REBEL: Reward Regularization-based Approach For Robotic Reinforcement Learning From Human Feedback (2023)0.00
- Aligning Humans And Robots Via Reinforcement Learning From Implicit Human Feedback (2025)2.26
- Self Punishment And Reward Backfill For Deep Q-learning (2020)7.16
- Cooperative Inverse Reinforcement Learning (2016)0.00