← Back to all categories

Red Teaming

12 resources

Red Teaming & Evaluation

AI red team methodology, automation, and frameworks

paper reviewed open access 2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

paper reviewed open access 2024

PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs

Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024

Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.

dataset reviewed open access 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

paper reviewed open access 2024

Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024

Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.

paper reviewed open access 2024

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint

Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.

paper reviewed open access 2024

StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors

Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint

Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.

paper reviewed open access 2024

Garak: A Framework for Security Probing Large Language Models

Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint

Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.

tool reviewed open access 2024

PyRIT: Python Risk Identification Toolkit for Generative AI

Microsoft AI Red Team — GitHub / Microsoft

Microsoft's open-source framework for red teaming generative AI systems, supporting automated prompt generation, attack strategies, and scoring of AI responses.

report reviewed open access 2024

Google DeepMind: Gemini Security Assessment and Red Team Evaluation

Google DeepMind — Google AI Blog

Describes Google's approach to red teaming Gemini models, including automated and human evaluation methods for safety and security.

report reviewed open access 2024

Microsoft AI Red Team: Lessons Learned and Best Practices

Microsoft AI Red Team — Microsoft Blog

Shares lessons from Microsoft's AI red team operations including methodology, tooling, common failure modes, and best practices.

report reviewed open access 2023

OpenAI: Preparedness Framework (Beta)

OpenAI — OpenAI Blog

OpenAI's approach to tracking, evaluating, forecasting, and protecting against catastrophic risks of frontier AI models.

paper reviewed open access 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion + 30 more — arXiv preprint

Describes Anthropic's early red teaming methodology for language models, documenting methods, scaling behaviors, and lessons for identifying harmful outputs.