Jailbreaking

17 resources

Attacks & Threats

Guardrail bypass and alignment subversion techniques

paper reviewed open access 2024

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023

Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.

jailbreaking guardrails 520 citations

paper reviewed open access 2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

jailbreaking adversarial examples red teaming 230 citations

paper reviewed open access 2024

PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs

Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024

Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.

jailbreaking red teaming 220 citations

paper reviewed open access 2024

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024

Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.

jailbreaking adversarial examples 220 citations

dataset reviewed open access 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

red teaming benchmarks jailbreaking 180 citations

paper reviewed open access 2024

Are Aligned Neural Networks Adversarially Aligned?

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023

Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.

adversarial examples guardrails jailbreaking 180 citations

paper reviewed open access 2024

Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024

Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.

jailbreaking red teaming 175 citations

paper reviewed open access 2024

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang + 4 more — ICLR 2024

Demonstrates that LLMs can be jailbroken using cipher-based encoding, bypassing safety training designed for natural language.

jailbreaking 160 citations

paper reviewed open access 2024

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint

Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.

survey adversarial examples jailbreaking 125 citations

paper reviewed open access 2024

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety

Yi Zeng, Hongpeng Lin, Jingwen Zhang + 3 more — ACL 2024

Applies social science persuasion techniques to jailbreak LLMs, showing high attack success rates using persuasion taxonomy.

jailbreaking social engineering 85 citations

paper reviewed open access 2024

Adaptive Attacks Break Defenses Against LLM Jailbreaking

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.

jailbreaking guardrails adversarial examples 50 citations

report reviewed open access 2024

Anthropic: Many-shot Jailbreaking

Anthropic — Anthropic Research Blog

Reveals many-shot jailbreaking, a technique exploiting long context windows by including many examples of harmful Q&A pairs to override safety training.

jailbreaking

paper reviewed open access 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint

Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.

jailbreaking adversarial examples 890 citations

paper reviewed open access 2023

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes + 2 more — CCS 2024

Collects and analyzes 6,387 jailbreak prompts from the wild, developing a comprehensive taxonomy of jailbreak techniques and evaluating their effectiveness.

jailbreaking threat modeling 310 citations

paper reviewed open access 2023

LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint

Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.

fine tuning security guardrails jailbreaking 190 citations

paper reviewed open access 2023

Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan + 7 more — EMNLP 2023

Presents results from a global prompt hacking competition with 600K+ adversarial prompts, revealing systemic LLM vulnerabilities across multiple models and defense strategies.

prompt injection jailbreaking benchmarks 180 citations

paper reviewed open access 2023

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan + 4 more — EMNLP 2023 Findings

Demonstrates multi-step jailbreaking attacks to extract personal information from ChatGPT, showing how sequential prompting can bypass safety measures.

jailbreaking data anonymization membership inference 175 citations