Membership Inference
6 resourcesAttacks & Threats
Determining whether specific data was used in training
Prompt Stealing Attacks Against Text-to-Image Generation Models
Xinyue Shen, Yiting Qu, Michael Backes + 1 more — USENIX Security 2024
Demonstrates attacks that steal the prompts used to generate images from text-to-image models, raising IP and privacy concerns.
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil, Peter Hase, Mohit Bansal — ICLR 2024
Evaluates methods for deleting sensitive information from trained LLMs, finding current unlearning approaches insufficient against determined adversaries.
Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models
Jeffrey Cheng, Ruoxi Jia — arXiv preprint
Develops precise methods for detecting and extracting training data from LLMs when white-box access is available, with implications for copyright and privacy.
Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, Jonathan Hayase + 7 more — arXiv preprint
Develops a scalable attack to extract over a gigabyte of training data from semi-open and closed models including ChatGPT, at a cost of roughly $200.
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Haoran Li, Dadi Guo, Wei Fan + 4 more — EMNLP 2023 Findings
Demonstrates multi-step jailbreaking attacks to extract personal information from ChatGPT, showing how sequential prompting can bypass safety measures.
Extracting Training Data from Large Language Models
Nicholas Carlini, Florian Tramer, Eric Wallace + 9 more — USENIX Security 2021
Demonstrates that large language models memorize and can be prompted to emit verbatim training data, including PII, revealing significant privacy risks.