dataset reviewed open access llmsec-2025-00004

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

2024-02 — ICML 2024 180 citations

View Resource PDF

Abstract

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

Framework Mappings

OWASP LLM: LLM01 NIST AI RMF: MEASURE

Cite This Resource

@article{llmsec202500004,
  title = {HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal},
  author = {Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks},
  year = {2024},
  journal = {ICML 2024},
  url = {https://arxiv.org/abs/2402.04249},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2402.04249

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata