Red Teaming
12 resourcesRed Teaming & Evaluation
AI red team methodology, automation, and frameworks
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024
Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.
PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs
Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024
Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024
Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.
Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024
Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint
Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.
StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors
Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint
Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.
Garak: A Framework for Security Probing Large Language Models
Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint
Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.
PyRIT: Python Risk Identification Toolkit for Generative AI
Microsoft AI Red Team — GitHub / Microsoft
Microsoft's open-source framework for red teaming generative AI systems, supporting automated prompt generation, attack strategies, and scoring of AI responses.
Google DeepMind: Gemini Security Assessment and Red Team Evaluation
Google DeepMind — Google AI Blog
Describes Google's approach to red teaming Gemini models, including automated and human evaluation methods for safety and security.
Microsoft AI Red Team: Lessons Learned and Best Practices
Microsoft AI Red Team — Microsoft Blog
Shares lessons from Microsoft's AI red team operations including methodology, tooling, common failure modes, and best practices.
OpenAI: Preparedness Framework (Beta)
OpenAI — OpenAI Blog
OpenAI's approach to tracking, evaluating, forecasting, and protecting against catastrophic risks of frontier AI models.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion + 30 more — arXiv preprint
Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.