Self-induced Outcome Potential: Turn-level Credit Assignment For Agents Without Verifiers
2026 Β· Senkang Hu, Yong Dai, Xudong Han, et al.
Abstract
arXiv:2605.04984v1 Announce Type: new Abstract: Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing
Authors
(none)
Tags
Stats
Related papers
- MICA: Multi-granularity Intertemporal Credit Assignment For Long-horizon Emotional Support Dialogue (2026)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00
- Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms (2025)0.00
- Rollout Pass-rate Control: Steering Binary-reward RL Toward Its Most Informative Regime (2026)0.00
- DGPO: Distribution Guided Policy Optimization For Fine Grained Credit Assignment (2026)0.00
- Counterfactual Credit Assignment In Model-free Reinforcement Learning (2020)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- QLLM: Do We Really Need A Mixing Network For Credit Assignment In Multi-agent Reinforcement Learning? (2025)0.00