Rewarding The Scientific Process: Process-level Reward Modeling For Agentic Data Analysis
2026 Β· Zhisong Qiu, Shuofei Qiao, Kewei Xu, et al.
Abstract
arXiv:2604.24198v1 Announce Type: cross Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strateg
Authors
(none)
Tags
Stats
Related papers
- Inferring Probabilistic Reward Machines From Non-markovian Reward Processes For Reinforcement Learning (2021)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00
- Recode: Reinforcing Code Generation With Reasoning-process Rewards (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Aligning Agents Via Planning: A Benchmark For Trajectory-level Reward Modeling (2026)0.00
- Abstract Reward Processes: Leveraging State Abstraction For Consistent Off-policy Evaluation (2024)0.00
- Learning Human Rewards By Inferring Their Latent Intelligence Levels In Multi-agent Games: A Theory-of-mind Approach With Application To Driving Data (2021)0.00
- Scalable Agent Alignment Via Reward Modeling: A Research Direction (2018)0.00