paper reviewed open access llmsec-2024-00017

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tyre, Jared Kaplan, Chris Olah, Sam McCandlish, Dario Amodei

2022-08 — arXiv preprint 750 citations

View Resource PDF

Abstract

Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.

Framework Mappings

NIST AI RMF: MEASURE NIST AI RMF: MANAGE

Cite This Resource

@article{llmsec202400017,
  title = {Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned},
  author = {Deep Ganguli and Liane Lovitt and Jackson Kernion and Amanda Askell and Yuntao Bai and Saurav Kadavath and Ben Mann and Ethan Perez and Nicholas Schiefer and Kamal Ndousse and Andy Jones and Sam Bowman and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Nelson Elhage and Sheer El-Showk and Stanislav Fort and Zac Hatfield-Dodds and Tom Henighan and Danny Hernandez and Tristan Hume and Josh Jacobson and Scott Johnston and Shauna Kravec and Catherine Olsson and Sam Ringer and Eli Tyre and Jared Kaplan and Chris Olah and Sam McCandlish and Dario Amodei},
  year = {2022},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2209.07858},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2209.07858

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata