Reward Hacking Benchmark: Measuring Exploits In LLM Agents With Tool Use
2026 Β· Kunvar Thaman
Abstract
arXiv:2605.02964v1 Announce Type: cross Abstract: Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent ga
Authors
(none)
Tags
Stats
Related papers
- The Effects Of Reward Misspecification: Mapping And Mitigating Misaligned Models (2022)0.00
- Aligning Agents Via Planning: A Benchmark For Trajectory-level Reward Modeling (2026)0.00
- SCRIBE: Structured Mid-level Supervision For Tool-using Language Models (2026)0.00
- Correlated Proxies: A New Definition And Improved Mitigation For Reward Hacking (2024)2.76
- Reward Shaping For Happier Autonomous Cyber Security Agents (2023)9.23
- Hackatari: Atari Learning Environments For Robust And Continual Reinforcement Learning (2024)0.00
- Rlexplore: Accelerating Research In Intrinsically-motivated Reinforcement Learning (2024)5.33
- REBEL: Reward Regularization-based Approach For Robotic Reinforcement Learning From Human Feedback (2023)0.00