← Back to search
paper reviewed open access llmsec-2024-00001

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

2024-01 — NeurIPS 2023 520 citations

Abstract

Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.

Categories

Tags

safety-trainingalignmentfailure-modes

Framework Mappings

OWASP LLM: LLM01 MITRE ATLAS: AML.T0054

Cite This Resource

@article{llmsec202400001,
  title = {Jailbroken: How Does LLM Safety Training Fail?},
  author = {Alexander Wei and Nika Haghtalab and Jacob Steinhardt},
  year = {2024},
  journal = {NeurIPS 2023},
  url = {https://arxiv.org/abs/2307.02483},
}

Metadata

Added
2026-04-14
Added by
manual
Source
manual
arxiv_id
2307.02483