GFT: From Imitation To Reward Fine-tuning With Unbiased Group Advantages And Dynamic Coefficient Rectification
2026 Β· Wangjie Gan, Miao Pan, Linbo Xi, et al.
Abstract
arXiv:2604.14258v2 Announce Type: replace-cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to sta
Authors
(none)
Tags
Stats
Related papers
- GIFT: Global Stabilisation Via Intrinsic Fine Tuning (2026)0.00
- DGPO: Distribution Guided Policy Optimization For Fine Grained Credit Assignment (2026)0.00
- GIFT: Group-relative Implicit Fine-tuning Integrates GRPO With DPO And UNA (2025)0.00
- Proximal Supervised Fine-tuning (2025)0.00
- Provably Mitigating Overoptimization In RLHF: Your SFT Loss Is Implicitly An Adversarial Regularizer (2024)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Polychromic Objectives For Reinforcement Learning (2026)0.00