Adversarial Examples
11 resourcesAttacks & Threats
Evasion attacks and adversarial perturbations
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint
Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024
Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024
Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.
Are Aligned Neural Networks Adversarially Aligned?
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023
Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint
Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Yangsibo Huang, Samyak Gupta, Mengzhou Xia + 2 more — ICLR 2024
Demonstrates that safety alignment in LLMs is brittle and can be undermined through simple weight pruning or low-rank modifications without any fine-tuning data.
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023
Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024
Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.
Adaptive Attacks Break Defenses Against LLM Jailbreaking
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.
Adversarial Attacks on Multimodal Agents
Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov + 2 more — arXiv preprint
Demonstrates adversarial attacks on multimodal agents that take actions in digital environments, showing visual perturbations can hijack agent behavior.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint
Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.