paper reviewed open access llmsec-2024-00001

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

2024-01 — NeurIPS 2023 520 citations

Abstract

Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.

Framework Mappings

OWASP LLM: LLM01 MITRE ATLAS: AML.T0054

Cite This Resource

@article{llmsec202400001,
  title = {Jailbroken: How Does LLM Safety Training Fail?},
  author = {Alexander Wei and Nika Haghtalab and Jacob Steinhardt},
  year = {2024},
  journal = {NeurIPS 2023},
  url = {https://arxiv.org/abs/2307.02483},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2307.02483

Jailbroken: How Does LLM Safety Training Fail?

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata