← Back to search
paper reviewed open access llmsec-2024-00017
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tyre, Jared Kaplan, Chris Olah, Sam McCandlish, Dario Amodei
2022-08 — arXiv preprint 750 citations
Abstract
Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.
Framework Mappings
NIST AI RMF: MEASURE NIST AI RMF: MANAGE
Cite This Resource
@article{llmsec202400017,
title = {Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned},
author = {Deep Ganguli and Liane Lovitt and Jackson Kernion and Amanda Askell and Yuntao Bai and Saurav Kadavath and Ben Mann and Ethan Perez and Nicholas Schiefer and Kamal Ndousse and Andy Jones and Sam Bowman and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Nelson Elhage and Sheer El-Showk and Stanislav Fort and Zac Hatfield-Dodds and Tom Henighan and Danny Hernandez and Tristan Hume and Josh Jacobson and Scott Johnston and Shauna Kravec and Catherine Olsson and Sam Ringer and Eli Tyre and Jared Kaplan and Chris Olah and Sam McCandlish and Dario Amodei},
year = {2022},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/2209.07858},
} Metadata
- Added
- 2026-04-14
- Added by
- manual
- Source
- manual
- arxiv_id
- 2209.07858