Data Poisoning
6 resourcesAttacks & Threats
Training data, fine-tuning, and RAG poisoning attacks
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint
Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.
Poisoning Web-Scale Training Datasets is Practical
Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo + 6 more — IEEE S&P 2024
Demonstrates practical attacks to poison web-scale datasets like LAION by purchasing expired domains, affecting 0.01% of a dataset for under $60.
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
Wei Zou, Runpeng Geng, Binghui Wang + 1 more — arXiv preprint
Demonstrates knowledge poisoning attacks against RAG systems where adversaries inject malicious texts into the knowledge database to manipulate LLM outputs.
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023
Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024
Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.
Poisoning Language Models During Instruction Tuning
Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023
Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.