Training Value-aligned Reinforcement Learning Agents Using A Normative Prior
2021 Β· Md Sultan Al Nahian, Spencer Frazier, Brent Harrison, et al.
Abstract
As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our value-alignment technique on three interactive text-based worlds; each world is designed specifically
Authors
(none)
Tags
Stats
Related papers
- Vision-based Generic Potential Function For Policy Alignment In Multi-agent Reinforcement Learning (2025)0.00
- Scalable Agent Alignment Via Reward Modeling: A Research Direction (2018)0.00
- Aligning Agents Via Planning: A Benchmark For Trajectory-level Reward Modeling (2026)0.00
- Interpretable Multi-objective Reinforcement Learning Through Policy Orchestration (2018)0.00
- Strategy Masking: A Method For Guardrails In Value-based Reinforcement Learning Agents (2025)0.00
- Toward Virtuous Reinforcement Learning: A Critique And Roadmap (2025)0.00
- Symbol Guided Hindsight Priors For Reward Learning From Human Preferences (2022)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00