Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models
2023 Β· Ziniu Li, Tian Xu, Yushun Zhang, et al.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general reinforcement learning tasks, it is overly sophisticated for LLMs, leading to laborious hyper-parameter tuning and significant computation burdens. To make RLHF efficient, we present ReMax, which leverages 3 properties of RLHF: fast simulation, deterministic transitions, and trajectory-level rewards. These properties are not exploited in PPO, making it less suitable for RLHF. Building on the renowned REINFORCE algorithm, ReMax does not require training an additional value model as in PPO and is further enhanced with a new variance reduction technique. ReMax offers several benefits over PPO: it is simpler to implement, eliminates more than 4 hyper-parameters in PPO, reduces GPU memory usage, and shortens training time. ReMax can save about 46% GPU memory than PPO when
Authors
(none)
Tags
Stats
Related papers
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Reinforcement Learning Fine-tunes A Sparse Subnetwork In Large Language Models (2025)0.00
- Response-level Rewards Are All You Need For Online Reinforcement Learning In Llms: A Mathematical Perspective (2025)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)0.00
- DISPO: Enhancing Training Efficiency And Stability In Reinforcement Learning For Large Language Model Mathematical Reasoning (2026)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00