Data Poisoning

6 resources

Attacks & Threats

Training data, fine-tuning, and RAG poisoning attacks

paper reviewed open access 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint

Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.

data poisoning guardrails adversarial examples 420 citations

paper reviewed open access 2024

Poisoning Web-Scale Training Datasets is Practical

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo + 6 more — IEEE S&P 2024

Demonstrates practical attacks to poison web-scale datasets like LAION by purchasing expired domains, affecting 0.01% of a dataset for under $60.

data poisoning supply chain attacks 250 citations

paper reviewed open access 2024

PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models

Wei Zou, Runpeng Geng, Binghui Wang + 1 more — arXiv preprint

Demonstrates knowledge poisoning attacks against RAG systems where adversaries inject malicious texts into the knowledge database to manipulate LLM outputs.

data poisoning rag security 85 citations

paper reviewed open access 2024

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023

Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.

data poisoning adversarial examples supply chain attacks 85 citations

paper reviewed open access 2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024

Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.

data poisoning adversarial examples 55 citations

paper reviewed open access 2023

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023

Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.

data poisoning fine tuning security 210 citations