Value Augmented Sampling For Language Model Alignment And Personalization
2024 Β· Seungwook Han, Idan Shenfeld, Akash Srivastava, et al.
Abstract
Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike ex
Authors
(none)
Tags
Stats
Related papers
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models (2023)0.00
- Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success (2025)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Gradalign: Gradient-aligned Data Selection For LLM Reinforcement Learning (2026)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00
- Discovering Multiagent Learning Algorithms With Large Language Models (2026)2.05