← all papers · overview

RedAgent: An Autonomous Agent for Context-Aware Red Teaming of LLM Jailbreaks

Abstract

Recently, Large Language Models (LLMs) have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to complex real-world jailbreak threats. Despite the promising advances in actively finding jailbreak vulnerabilities of LLMs (i.e., red teaming) in general contexts, identifying these threats in complex domain-specific contexts (e.g, mathematical LLMs) remains underexplored. In this paper, we study whether the context these real-world LLM applications work in, including different system prompts, tools, and scenarios of tasks, give rise to context-specific jailbreak threats. Particularly, we adapt general jailbreak prompts to the context of the target application via LLM rewriting to generate context-specific attacks. By measuring the differences in jailbreak responses between general attacks and context-specific attacks, we reveal that customized domain-specific LLMs are more vulnerable in their specific context. Motivated by this observation, we propose a context-aware red teaming approach, RedAgent, to generate context-specific jailbreak attacks towards customized LLM applications. Through effectively retrieving and updating structured knowledge in an agent system, RedAgent efficiently perceive and utilize contextual information to adapt the jailbreak prompts to the target contexts. Extensive experiments show that our system can jailbreak most black-box LLMs within just five queries, improving the efficiency of existing red teaming methods by two times. Further, RedAgent can effectively jailbreak customized LLM applications. By generating context-specific jailbreak prompts towards 60 trending applications on the marketplace of OpenAI, we discover 600 vulnerabilities of these real-world applications with only two queries per vulnerability.

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).