Jailbreaking
17 resourcesAttacks & Threats
Guardrail bypass and alignment subversion techniques
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023
Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024
Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.
PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs
Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024
Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024
Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024
Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.
Are Aligned Neural Networks Adversarially Aligned?
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023
Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.
Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024
Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang + 4 more — ICLR 2024
Demonstrates that LLMs can be jailbroken using cipher-based encoding, bypassing safety training designed for natural language.
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint
Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety
Yi Zeng, Hongpeng Lin, Jingwen Zhang + 3 more — ACL 2024
Applies social science persuasion techniques to jailbreak LLMs, showing high attack success rates using persuasion taxonomy.
Adaptive Attacks Break Defenses Against LLM Jailbreaking
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.
Anthropic: Many-shot Jailbreaking
Anthropic — Anthropic Research Blog
Reveals many-shot jailbreaking, a technique exploiting long context windows by including many examples of harmful Q&A pairs to override safety training.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint
Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.
Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen, Zeyuan Chen, Michael Backes + 2 more — CCS 2024
Collects and analyzes 6,387 jailbreak prompts from the wild, developing a comprehensive taxonomy of jailbreak techniques and evaluating their effectiveness.
LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint
Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.
Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition
Sander Schulhoff, Jeremy Pinto, Anaum Khan + 7 more — EMNLP 2023
Presents results from a global prompt hacking competition with 600K+ adversarial prompts, revealing systemic LLM vulnerabilities across multiple models and defense strategies.
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Haoran Li, Dadi Guo, Wei Fan + 4 more — EMNLP 2023 Findings
Demonstrates multi-step jailbreaking attacks to extract personal information from ChatGPT, showing how sequential prompting can bypass safety measures.