Recode: Reinforcing Code Generation With Reasoning-process Rewards
2026 Β· Lishui Fan, Yu Zhang, Mouxiang Chen, et al.
Abstract
arXiv:2508.05170v3 Announce Type: replace-cross Abstract: In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces two challenges. First, training reliable reward models to assess reasoning quality is bottlenecked by the scarcity of fine-grained preference data. Second, naively incorporating such neural rewards may suffer from reward hacking. This work proposes ReCode (Reasoning-Reinforced Code Generation), a novel RL training framework comprising: (1) Contrastive Reasoning-Process Reward Learning (CRPL), which trains a reward model with synthesized optimized and degraded reasoning variants to assess the quality of reasoning process; and (2) Consistency-Gated GRPO (CG-GRPO), which integrates the reasoning-process reward model into RL by gating neural reasoning-process rewards with strict execution outcome
Authors
(none)
Tags
Stats
Related papers
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Scheduling Your LLM Reinforcement Learning With Reasoning Trees (2026)0.00
- Portool: Importance-aware Policy Optimization With Rewarded Tree For Multi-tool-integrated Reasoning (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Rewarding The Scientific Process: Process-level Reward Modeling For Agentic Data Analysis (2026)5.07
- Rl-star: Theoretical Analysis Of Reinforcement Learning Frameworks For Self-taught Reasoner (2024)0.00