Scalable Agent Alignment Via Reward Modeling: A Research Direction
2018 Β· Jan Leike, David Krueger, Tom Everitt, et al.
Abstract
One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.
Authors
(none)
Tags
Stats
Related papers
- Aligning Agents Via Planning: A Benchmark For Trajectory-level Reward Modeling (2026)0.00
- Reward Design For Reinforcement Learning Agents (2025)0.00
- Reward Models In Deep Reinforcement Learning: A Survey (2025)0.00
- An Agent Design With Goal Reaching Guarantees For Enhancement Of Learning (2024)0.00
- ELIGN: Expectation Alignment As A Multi-agent Intrinsic Reward (2022)0.00
- REBEL: Reward Regularization-based Approach For Robotic Reinforcement Learning From Human Feedback (2023)0.00
- Coordinated Exploration Via Intrinsic Rewards For Multi-agent Reinforcement Learning (2019)0.00
- Tiered Reward: Designing Rewards For Specification And Fast Learning Of Desired Behavior (2022)0.00