Efficiently Aligning Language Models With Online Natural Language Feedback
2026 Β· Christine Ye, Joe Benton
Abstract
arXiv:2605.04356v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in "fuzzy", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in Qwen3-8B and Haiku 4.5 respectively. For Qwen3-8B,
Authors
(none)
Tags
Stats
Related papers
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)0.00
- Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models (2023)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00
- Data-dependent Exploration For Online Reinforcement Learning From Human Feedback (2026)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Co-evolution Of Policy And Internal Reward For Language Agents (2026)0.00