Recover Triggered States: Protect Model Against Backdoor Attack In Reinforcement Learning
2023 Β· Hao Chen, Chen Gong, Yizhe Wang, et al.
Abstract
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. Such attacks compromise the RL system's reliability, leading to potentially catastrophic results in various key fields. In contrast, relatively limited research has investigated effective defenses against backdoor attacks in RL. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks. RTS involves building a surrogate network to approximate the dynamics model. Developers can then recover the environment from the triggered state to a clean state, thereby preventing attackers from activating backdoors hidden in the agent by presenting the trigger. When training the surrogate to predict states, we incorporate agent action information to reduce the discrepancy between the actions taken by the agent on predicted states and the actions taken on real sta
Authors
(none)
Tags
Stats
Related papers
- Provable Defense Against Backdoor Policies In Reinforcement Learning (2022)2.26
- Adversarial Inception Backdoor Attacks Against Reinforcement Learning (2024)0.00
- Backdoor Attacks On Multiagent Collaborative Systems (2022)0.00
- Policycleanse: Backdoor Detection And Mitigation In Reinforcement Learning (2022)0.00
- Beware Untrusted Simulators -- Reward-free Backdoor Attacks In Reinforcement Learning (2026)0.00
- Beyond Training-time Poisoning: Component-level And Post-training Backdoors In Deep Reinforcement Learning (2025)0.00
- Cooperative Backdoor Attack In Decentralized Reinforcement Learning With Theoretical Guarantee (2024)0.00
- A Spatiotemporal Stealthy Backdoor Attack Against Cooperative Multi-agent Deep Reinforcement Learning (2024)0.00