paper reviewed open access llmsec-2024-00002

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

2023-07 — arXiv preprint 890 citations

Abstract

Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.

Framework Mappings

OWASP LLM: LLM01 MITRE ATLAS: AML.T0043 MITRE ATLAS: AML.T0054

Cite This Resource

@article{llmsec202400002,
  title = {Universal and Transferable Adversarial Attacks on Aligned Language Models},
  author = {Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson},
  year = {2023},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2307.15043},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2307.15043

Universal and Transferable Adversarial Attacks on Aligned Language Models

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata