paper reviewed open access llmsec-2024-00028

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Jared Kaplan, Dario Amodei, Sam McCandlish, Ethan Perez

2024-01 — arXiv preprint 420 citations

View Resource PDF

Abstract

Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.

Framework Mappings

OWASP LLM: LLM04 MITRE ATLAS: AML.T0018 MITRE ATLAS: AML.T0020

Cite This Resource

@article{llmsec202400028,
  title = {Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training},
  author = {Evan Hubinger and Carson Denison and Jesse Mu and Mike Lambert and Meg Tong and Monte MacDiarmid and Tamera Lanham and Daniel M. Ziegler and Tim Maxwell and Newton Cheng and Adam Jermyn and Amanda Askell and Ansh Radhakrishnan and Cem Anil and David Duvenaud and Deep Ganguli and Fazl Barez and Jack Clark and Kamal Ndousse and Kshitij Sachan and Michael Sellitto and Mrinank Sharma and Nova DasSarma and Roger Grosse and Shauna Kravec and Yuntao Bai and Jared Kaplan and Dario Amodei and Sam McCandlish and Ethan Perez},
  year = {2024},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2401.05566},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2401.05566

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata