← Back to all categories

Data Poisoning

6 resources

Attacks & Threats

Training data, fine-tuning, and RAG poisoning attacks

paper reviewed open access 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint

Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.

paper reviewed open access 2024

Poisoning Web-Scale Training Datasets is Practical

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo + 6 more — IEEE S&P 2024

Demonstrates practical attacks to poison web-scale datasets like LAION by purchasing expired domains, affecting 0.01% of a dataset for under $60.

paper reviewed open access 2024

PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models

Wei Zou, Runpeng Geng, Binghui Wang + 1 more — arXiv preprint

Demonstrates knowledge poisoning attacks against RAG systems where adversaries inject malicious texts into the knowledge database to manipulate LLM outputs.

paper reviewed open access 2024

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023

Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.

paper reviewed open access 2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024

Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.

paper reviewed open access 2023

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023

Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.