The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback
2023 Β· Nathan Lambert, Roberto Calandra
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said data, and optimizing a base ML model with respect to said reward for extrinsic evaluation metrics (e.g. MMLU, GSM8k). RLHF relies on many assumptions about how the various pieces fit together, such as a reward model capturing human preferences and an RL optimizer extracting the right signal from a reward model. As the RLHF process involves many distinct design decisions, it is easy to assume that multiple processes are correlated and therefore numerically linked. This apparent correlation is often not true, where reward models are easily overoptimized or RL optimizers can reduce performance on tasks not modeled in the data. Notable manifestations of models trained with imperfect RLHF systems are those that are prone to refusing basic requests for saf
Authors
(none)
Tags
Stats
Related papers
- A Survey Of Reinforcement Learning From Human Feedback (2023)0.00
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00
- Mapping Out The Space Of Human Feedback For Reinforcement Learning: A Conceptual Framework (2024)0.00
- A Survey On Enhancing Reinforcement Learning In Complex Environments: Insights From Human And LLM Feedback (2024)0.00
- Reinforcement Learning In The Era Of Llms: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, And Beyond (2023)0.00
- When Your Ais Deceive You: Challenges Of Partial Observability In Reinforcement Learning From Human Feedback (2024)0.00
- Aligning Humans And Robots Via Reinforcement Learning From Implicit Human Feedback (2025)2.26
- Explaining And Preventing Alignment Collapse In Iterative RLHF (2026)0.00