← Back to search
paper reviewed open access llmsec-2023-00003

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

2023-05 — ICML 2023 210 citations

Abstract

Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.

Categories

Tags

instruction-tuningbackdoorfine-tuning

Framework Mappings

OWASP LLM: LLM04 MITRE ATLAS: AML.T0020

Cite This Resource

@article{llmsec202300003,
  title = {Poisoning Language Models During Instruction Tuning},
  author = {Alexander Wan and Eric Wallace and Sheng Shen and Dan Klein},
  year = {2023},
  journal = {ICML 2023},
  url = {https://arxiv.org/abs/2305.00944},
}

Metadata

Added
2026-04-14
Added by
manual
Source
manual
arxiv_id
2305.00944