Proximal Supervised Fine-tuning
2025 Β· Wenhong Zhu, Ruobing Xie, Rui Wang, et al.
Abstract
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimizat
Authors
(none)
Tags
Stats
Related papers
- Simple Policy Optimization (2024)0.00
- Truly Proximal Policy Optimization (2019)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- GFT: From Imitation To Reward Fine-tuning With Unbiased Group Advantages And Dynamic Coefficient Rectification (2026)0.00
- Neural Proximal/trust Region Policy Optimization Attains Globally Optimal Policy (2019)0.00
- It's Not You, It's Clipping: A Soft Trust-region Via Probability Smoothing For LLM RL (2025)0.00
- SAFE: Stable Alignment Finetuning With Entropy-aware Predictive Control For Reinforcement Learning From Human Feedback (RLHF) (2026)0.00
- Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization (2018)0.00