Inpainting-guided Policy Optimization For Diffusion Large Language Models
2025 Β· Siyan Zhao, Mengchen Liu, Jing Huang, et al.
Abstract
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO
Authors
(none)
Tags
Stats
Related papers
- Simple Policy Gradients For Reasoning With Diffusion Language Models (2025)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models (2025)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Diffpo: Training Diffusion Llms To Reason Fast And Furious Via Reinforcement Learning (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Genpo: Generative Diffusion Models Meet On-policy Reinforcement Learning (2025)0.00
- Llm-explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven By Large Language Models (2025)0.00