← Back to search
paper reviewed open access llmsec-2024-00001
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
2024-01 — NeurIPS 2023 520 citations
Abstract
Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.
Framework Mappings
OWASP LLM: LLM01 MITRE ATLAS: AML.T0054
Cite This Resource
@article{llmsec202400001,
title = {Jailbroken: How Does LLM Safety Training Fail?},
author = {Alexander Wei and Nika Haghtalab and Jacob Steinhardt},
year = {2024},
journal = {NeurIPS 2023},
url = {https://arxiv.org/abs/2307.02483},
} Metadata
- Added
- 2026-04-14
- Added by
- manual
- Source
- manual
- arxiv_id
- 2307.02483