HarmBench

Name: HarmBench
License: mit

Emerging

7papers using it

8,603HF downloads

46HF likes

2024first seen

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Data: Dataset About In this dataset card, we only use the behavior prompts proposed in HarmBench. License MIT Citation If you fin

🤗 Hugging Face⚖ mit

Papers using HarmBench (7)

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack2026

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection2025

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features2025

Jailbreaking to Jailbreak2025

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search2025

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection2025

LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs2024