HarmBench
Emerging7papers using it
8,603HF downloads
46HF likes
2024first seen
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Data: Dataset About In this dataset card, we only use the behavior prompts proposed in HarmBench. License MIT Citation If you fin
π€ Hugging Faceβ mit
Papers using HarmBench (7)
- Furina: Fragmented Uncertainty-Driven Refusal Instability AttackSABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual ConnectionCorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder FeaturesJailbreaking to JailbreakRainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary
Quality-Diversity SearchCorrSteer: Steering Improves Task Performance and Safety in LLMs through
Correlation-based Sparse Autoencoder Feature SelectionLLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs