← Back to all categories

Jailbreaking

17 resources

Attacks & Threats

Guardrail bypass and alignment subversion techniques

paper reviewed open access 2024

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023

Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.

paper reviewed open access 2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

paper reviewed open access 2024

PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs

Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024

Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.

paper reviewed open access 2024

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024

Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.

dataset reviewed open access 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

paper reviewed open access 2024

Are Aligned Neural Networks Adversarially Aligned?

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023

Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.

paper reviewed open access 2024

Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024

Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.

paper reviewed open access 2024

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang + 4 more — ICLR 2024

Demonstrates that LLMs can be jailbroken using cipher-based encoding, bypassing safety training designed for natural language.

jailbreaking 160 citations
paper reviewed open access 2024

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint

Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.

paper reviewed open access 2024

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety

Yi Zeng, Hongpeng Lin, Jingwen Zhang + 3 more — ACL 2024

Applies social science persuasion techniques to jailbreak LLMs, showing high attack success rates using persuasion taxonomy.

paper reviewed open access 2024

Adaptive Attacks Break Defenses Against LLM Jailbreaking

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.

report reviewed open access 2024

Anthropic: Many-shot Jailbreaking

Anthropic — Anthropic Research Blog

Reveals many-shot jailbreaking, a technique exploiting long context windows by including many examples of harmful Q&A pairs to override safety training.

paper reviewed open access 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint

Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.

paper reviewed open access 2023

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes + 2 more — CCS 2024

Collects and analyzes 6,387 jailbreak prompts from the wild, developing a comprehensive taxonomy of jailbreak techniques and evaluating their effectiveness.

paper reviewed open access 2023

LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint

Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.

paper reviewed open access 2023

Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan + 7 more — EMNLP 2023

Presents results from a global prompt hacking competition with 600K+ adversarial prompts, revealing systemic LLM vulnerabilities across multiple models and defense strategies.

paper reviewed open access 2023

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan + 4 more — EMNLP 2023 Findings

Demonstrates multi-step jailbreaking attacks to extract personal information from ChatGPT, showing how sequential prompting can bypass safety measures.