Adversarial Examples

11 resources

Attacks & Threats

Evasion attacks and adversarial perturbations

paper reviewed open access 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint

Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.

data poisoning guardrails adversarial examples 420 citations

paper reviewed open access 2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

jailbreaking adversarial examples red teaming 230 citations

paper reviewed open access 2024

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024

Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.

jailbreaking adversarial examples 220 citations

paper reviewed open access 2024

Are Aligned Neural Networks Adversarially Aligned?

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023

Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.

adversarial examples guardrails jailbreaking 180 citations

paper reviewed open access 2024

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint

Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.

survey adversarial examples jailbreaking 125 citations

paper reviewed open access 2024

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Yangsibo Huang, Samyak Gupta, Mengzhou Xia + 2 more — ICLR 2024

Demonstrates that safety alignment in LLMs is brittle and can be undermined through simple weight pruning or low-rank modifications without any fine-tuning data.

guardrails adversarial examples 110 citations

paper reviewed open access 2024

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023

Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.

data poisoning adversarial examples supply chain attacks 85 citations

paper reviewed open access 2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024

Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.

data poisoning adversarial examples 55 citations

paper reviewed open access 2024

Adaptive Attacks Break Defenses Against LLM Jailbreaking

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.

jailbreaking guardrails adversarial examples 50 citations

paper reviewed open access 2024

Adversarial Attacks on Multimodal Agents

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov + 2 more — arXiv preprint

Demonstrates adversarial attacks on multimodal agents that take actions in digital environments, showing visual perturbations can hijack agent behavior.

agentic threats adversarial examples tool use security 25 citations

paper reviewed open access 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint

Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.

jailbreaking adversarial examples 890 citations