Red Teaming

12 resources

Red Teaming & Evaluation

AI red team methodology, automation, and frameworks

paper reviewed open access 2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

jailbreaking adversarial examples red teaming 230 citations

paper reviewed open access 2024

PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs

Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024

Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.

jailbreaking red teaming 220 citations

dataset reviewed open access 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

red teaming benchmarks jailbreaking 180 citations

paper reviewed open access 2024

Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024

Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.

jailbreaking red teaming 175 citations

paper reviewed open access 2024

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint

Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.

benchmarks supply chain attacks red teaming 140 citations

paper reviewed open access 2024

StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors

Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint

Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.

benchmarks guardrails red teaming 65 citations

paper reviewed open access 2024

Garak: A Framework for Security Probing Large Language Models

Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint

Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.

red teaming fuzzing benchmarks 40 citations

tool reviewed open access 2024

PyRIT: Python Risk Identification Toolkit for Generative AI

Microsoft AI Red Team — GitHub / Microsoft

Microsoft's open-source framework for red teaming generative AI systems, supporting automated prompt generation, attack strategies, and scoring of AI responses.

red teaming fuzzing tool use security

report reviewed open access 2024

Google DeepMind: Gemini Security Assessment and Red Team Evaluation

Google DeepMind — Google AI Blog

Describes Google's approach to red teaming Gemini models, including automated and human evaluation methods for safety and security.

red teaming benchmarks industry report

report reviewed open access 2024

Microsoft AI Red Team: Lessons Learned and Best Practices

Microsoft AI Red Team — Microsoft Blog

Shares lessons from Microsoft's AI red team operations including methodology, tooling, common failure modes, and best practices.

red teaming industry report

report reviewed open access 2023

OpenAI: Preparedness Framework (Beta)

OpenAI — OpenAI Blog

OpenAI's approach to tracking, evaluating, forecasting, and protecting against catastrophic risks of frontier AI models.

risk frameworks red teaming model governance

paper reviewed open access 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion + 30 more — arXiv preprint

Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.

red teaming benchmarks 750 citations