Aligning Agents Via Planning: A Benchmark For Trajectory-level Reward Modeling
2026 Β· Jiaxuan Wang, Yulan Hu, Wenjin Yang, et al.
Abstract
In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges--most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal
Authors
(none)
Tags
Stats
Related papers
- Scalable Agent Alignment Via Reward Modeling: A Research Direction (2018)0.00
- Reward Hacking Benchmark: Measuring Exploits In LLM Agents With Tool Use (2026)0.00
- REBEL: Reward Regularization-based Approach For Robotic Reinforcement Learning From Human Feedback (2023)0.00
- Improving Multimodal Interactive Agents With Reinforcement Learning From Human Feedback (2022)0.00
- Non-markovian Reward Modelling From Trajectory Labels Via Interpretable Multiple Instance Learning (2022)0.00
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)0.00
- Learning Reward Functions For Cooperative Resilience In Multi-agent Systems (2026)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00