GenAI Security Literature Review
A comprehensive, community-driven, auto-updating database of GenAI and LLM security research, standards, tools, and resources.
Recent Additions
OWASP Top 10 for Large Language Model Applications
Steve Wilson, OWASP LLM AI Security Team — OWASP Foundation
The definitive OWASP guide identifying the top 10 most critical security risks in LLM applications, with descriptions, examples, and mitigation strategies.
OWASP Top 10 for Agentic AI Applications
OWASP Foundation — OWASP Foundation
Identifies the top 10 security risks specific to agentic AI applications including excessive agency, unsafe tool execution, and inadequate oversight.
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023
Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023
Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint
Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.
Most Cited
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu + 4 more — ICLR 2023
Foundational work on the ReAct paradigm for LLM agents that interleave reasoning and tool-use actions, enabling complex task completion with security implications.
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi + 6 more — NeurIPS 2023
Demonstrates how LLMs can learn to use external tools (APIs, search engines, calculators) through self-supervised learning, foundational for understanding tool-use security.
Extracting Training Data from Large Language Models
Nicholas Carlini, Florian Tramer, Eric Wallace + 9 more — USENIX Security 2021
Demonstrates that large language models memorize and can be prompted to emit verbatim training data, including PII, revealing significant privacy risks.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu + 44 more — arXiv preprint
Introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a constitution) and AI-generated feedback, reducing reliance on human red teamers.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini + 3 more — arXiv preprint
Proposes an automated method (GCG) to generate adversarial suffixes that cause aligned LLMs to produce harmful content, with attacks transferring across models including ChatGPT and Claude.