Search Resources

100 results

standardreviewed2025

OWASP Top 10 for Large Language Model Applications

Steve Wilson, OWASP LLM AI Security Team — OWASP Foundation

The definitive OWASP guide identifying the top 10 most critical security risks in LLM applications, with descriptions, examples, and mitigation strategies.

threat modelingrisk frameworks

standardreviewed2025

OWASP Top 10 for Agentic AI Applications

OWASP Foundation — OWASP Foundation

Identifies the top 10 security risks specific to agentic AI applications including excessive agency, unsafe tool execution, and inadequate oversight.

threat modelingrisk frameworksagent architecture

paperreviewed2024

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023

Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.

jailbreakingguardrails520 citations

paperreviewed2024

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023

Comprehensive trustworthiness evaluation of GPT models across 8 dimensions including toxicity, bias, robustness, privacy, fairness, and machine ethics.

benchmarksresponsible aisurvey520 citations

paperreviewed2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint

Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.

data poisoningguardrailsadversarial examples420 citations

paperreviewed2024

A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly

Yifan Yao, Jinhao Duan, Kaidi Xu + 3 more — High-Confidence Computing

Comprehensive survey covering LLM security and privacy from three perspectives: beneficial applications of LLMs for security, attacks against LLMs, and defensive techniques.

survey350 citations

paperreviewed2024

TrustLLM: Trustworthiness in Large Language Models

Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024

Comprehensive study of LLM trustworthiness across truthfulness, safety, fairness, robustness, privacy, and machine ethics with benchmarks.

benchmarksresponsible aisurvey300 citations

paperreviewed2024

Prompt Injection Attack Against LLM-Integrated Applications

Yi Liu, Gelei Deng, Yuekang Li + 6 more — ACM Computing Surveys

First comprehensive survey of prompt injection attacks against LLM-integrated applications, categorizing attacks and defenses with a unified framework.

prompt injectionsurvey280 citations

paperreviewed2024

Poisoning Web-Scale Training Datasets is Practical

Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo + 6 more — IEEE S&P 2024

Demonstrates practical attacks to poison web-scale datasets like LAION by purchasing expired domains, affecting 0.01% of a dataset for under $60.

data poisoningsupply chain attacks250 citations

paperreviewed2024

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024

Proposes AutoDAN, a method for automatically generating stealthy jailbreak prompts that are semantically meaningful and can bypass perplexity-based defenses.

jailbreakingadversarial examplesred teaming230 citations

paperreviewed2024

PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs

Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024

Uses an attacker LLM to automatically generate jailbreak prompts through iterative refinement achieving high success with only black-box access.

jailbreakingred teaming220 citations

paperreviewed2024

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024

Shows adversarial images can jailbreak multimodal LLMs that are robust to text-only attacks, bypassing alignment through the visual channel.

jailbreakingadversarial examples220 citations

paperreviewed2024

LLM Agents Can Autonomously Hack Websites

Richard Fang, Rohan Bindu, Akul Gupta + 2 more — arXiv preprint

Demonstrates that LLM agents can autonomously perform web hacking tasks including SQL injection, XSS, and CSRF attacks without human guidance.

agentic threatsautonomous operationssocial engineering200 citations

paperreviewed2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig + 4 more — NeurIPS 2024

Demonstrates autonomous coding agents that interact with computer interfaces to solve software engineering tasks, raising questions about agent containment.

autonomous operationsagent architecturetool use security200 citations

datasetreviewed2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024

Introduces HarmBench, a standardized framework for evaluating automated red teaming methods and robust refusal in LLMs with a comprehensive behavior taxonomy.

red teamingbenchmarksjailbreaking180 citations

paperreviewed2024

On the Societal Impact of Open Foundation Models

Sayash Kapoor, Rishi Bommasani, Kevin Klyman + 2 more — arXiv preprint

Analyzes the societal impacts of open-weight foundation models, including security implications of open vs closed model access.

responsible aimodel governanceindustry report180 citations

paperreviewed2024

Are Aligned Neural Networks Adversarially Aligned?

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023

Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.

adversarial examplesguardrailsjailbreaking180 citations

paperreviewed2024

Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024

Introduces TAP using an LLM to iteratively refine jailbreak prompts against black-box target models with high success rates.

jailbreakingred teaming175 citations

paperreviewed2024

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang + 4 more — ICLR 2024

Demonstrates that LLMs can be jailbroken using cipher-based encoding, bypassing safety training designed for natural language.

jailbreaking160 citations

paperreviewed2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint

Proposes an instruction hierarchy for training LLMs to prioritize system prompts over user prompts over third-party content, as a defense against prompt injection.

prompt injectioninput filteringguardrails150 citations

paperreviewed2024

Security of AI-Based Code Generation Tools: A Multi-Perspective Study

Xinyi Hou, Yanjie Zhao, Yue Liu + 7 more — IEEE TSE

Examines security implications of AI code generation tools, analyzing vulnerability introduction patterns and mitigation strategies.

supply chain attacksmlops security150 citations

paperreviewed2024

LLM Agents Can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta + 1 more — arXiv preprint

Shows that LLM agents (GPT-4) can autonomously exploit real-world one-day vulnerabilities given CVE descriptions, achieving 87% success rate.

agentic threatsautonomous operationsvulnerability disclosure150 citations

paperreviewed2024

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint

Introduces CyberSecEval, a benchmark for evaluating the cybersecurity risks of LLM code generation, including insecure code suggestions.

benchmarkssupply chain attacksred teaming140 citations

paperreviewed2024

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint

Systematizes adversarial attacks and defenses for LLMs, connecting them to the classical adversarial ML literature while identifying LLM-specific threats.

surveyadversarial examplesjailbreaking125 citations

paperreviewed2024

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou + 4 more — ICLR 2024

Evaluates LLM privacy behavior through the lens of contextual integrity theory, finding significant mismatches between LLM norms and human privacy expectations.

differential privacydata anonymization110 citations

paperreviewed2024

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Yangsibo Huang, Samyak Gupta, Mengzhou Xia + 2 more — ICLR 2024

Demonstrates that safety alignment in LLMs is brittle and can be undermined through simple weight pruning or low-rank modifications without any fine-tuning data.

guardrailsadversarial examples110 citations

paperreviewed2024

Prompt Stealing Attacks Against Text-to-Image Generation Models

Xinyue Shen, Yiting Qu, Michael Backes + 1 more — USENIX Security 2024

Demonstrates attacks that steal the prompts used to generate images from text-to-image models, raising IP and privacy concerns.

model extractionmembership inference110 citations

paperreviewed2024

Stealing Part of a Production Language Model

Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham + 10 more — ICML 2024

Demonstrates that it is possible to steal the embedding projection layer of production LLMs like OpenAI's models through the API, confirming model extraction risks.

model extraction95 citations

paperreviewed2024

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes + 9 more — ICLR 2024

Uses data from an online game (Tensor Trust) where players compete to craft prompt injections and defenses, creating a large dataset of human-generated attacks.

prompt injectionbenchmarks95 citations

paperreviewed2024

A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models

Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan — arXiv preprint

Surveys attack techniques across the LLM lifecycle including training, fine-tuning, and inference, with comprehensive mitigation strategies.

surveythreat modeling90 citations

datasetreviewed2024

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu + 2 more — ACL 2024

Large-scale safety evaluation benchmark with 11,435 multiple-choice questions across 7 safety categories in both Chinese and English.

benchmarksguardrails90 citations

paperreviewed2024

PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models

Wei Zou, Runpeng Geng, Binghui Wang + 1 more — arXiv preprint

Demonstrates knowledge poisoning attacks against RAG systems where adversaries inject malicious texts into the knowledge database to manipulate LLM outputs.

data poisoningrag security85 citations

paperreviewed2024

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023

Proposes TrojLLM, a black-box attack that generates universal trojan prompts to compromise LLMs without access to model internals.

data poisoningadversarial examplessupply chain attacks85 citations

paperreviewed2024

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety

Yi Zeng, Hongpeng Lin, Jingwen Zhang + 3 more — ACL 2024

Applies social science persuasion techniques to jailbreak LLMs, showing high attack success rates using persuasion taxonomy.

jailbreakingsocial engineering85 citations

paperreviewed2024

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic + 3 more — arXiv preprint

Introduces AgentDojo, a framework for evaluating the security of LLM agents against prompt injection and other attacks in realistic tool-use scenarios.

agentic threatsprompt injectionbenchmarkstool use security75 citations

paperreviewed2024

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

guardrailsresponsible aifine tuning security75 citations

paperreviewed2024

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschlager, Alexander Isenko, Shiqiang Wang + 2 more — arXiv preprint

Examines federated learning approaches for fine-tuning LLMs on edge devices, analyzing privacy guarantees, communication efficiency, and security trade-offs.

federated learningfine tuning securityconfidential computing70 citations

paperreviewed2024

From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?

Rodrigo Pedro, Daniel Castro, Paolo Molina + 1 more — USENIX Security 2024

Demonstrates how prompt injection can be chained with traditional web attacks (SQL injection, XSS) in LLM-integrated applications.

prompt injectionmodel serving security70 citations

paperreviewed2024

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

Vaidehi Patil, Peter Hase, Mohit Bansal — ICLR 2024

Evaluates methods for deleting sensitive information from trained LLMs, finding current unlearning approaches insufficient against determined adversaries.

unlearningdifferential privacymembership inference70 citations

paperreviewed2024

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying + 1 more — ACL 2024 Findings

Presents InjecAgent, a benchmark for evaluating indirect prompt injection attacks against LLM agents that use tools, showing most agents are highly vulnerable.

prompt injectiontool use securitybenchmarks65 citations

paperreviewed2024

StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors

Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint

Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.

benchmarksguardrailsred teaming65 citations

paperreviewed2024

Machine Unlearning for Large Language Models: A Survey

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan + 2 more — arXiv preprint

Surveys machine unlearning techniques for LLMs including methods for forgetting specific training data, complying with data deletion requests, and maintaining model utility.

unlearningsurvey60 citations

paperreviewed2024

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.

prompt injectioninput filteringbenchmarks60 citations

paperreviewed2024

DP-SGD for Fine-Tuning Foundation Models: A Privacy-Utility Trade-off Study

Yu-Xiang Wang, Borja Balle, Shiva Prasad Kasiviswanathan — ICLR 2024

Investigates applying differentially private stochastic gradient descent to fine-tune large foundation models, characterizing the privacy-utility trade-off.

differential privacyfine tuning security55 citations

paperreviewed2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024

Demonstrates backdoor attacks on chain-of-thought reasoning in LLMs where poisoned demonstrations cause incorrect reasoning chains.

data poisoningadversarial examples55 citations

paperreviewed2024

Securing LLM Systems Against Prompt Injection

Yupei Liu, Yuqi Jia, Runpeng Geng + 2 more — arXiv preprint

Proposes defense mechanisms against prompt injection in LLM systems including isolation-based approaches, input/output filtering, and detection methods.

prompt injectioninput filteringsandboxing isolation50 citations

paperreviewed2024

Adaptive Attacks Break Defenses Against LLM Jailbreaking

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.

jailbreakingguardrailsadversarial examples50 citations

paperreviewed2024

AI Supply Chain Attacks and Mitigations: A Security-Focused Survey

Eitan Borgnia, Vinay Prabhu — IEEE S&P Workshop

Surveys the AI/ML supply chain attack surface including model repositories, training pipelines, and dependency risks, with practical mitigations.

supply chain attacksmlops security45 citations

paperreviewed2024

GPT in Sheep's Clothing: The Risk of Customized GPTs

Tao Qin, Zhen Li, Wenxin Mao + 1 more — arXiv preprint

Analyzes security risks of custom GPTs in the OpenAI GPT Store including prompt leakage, data exfiltration, and malicious GPTs.

supply chain attacksprompt injectionmodel serving security45 citations

paperreviewed2024

Garak: A Framework for Security Probing Large Language Models

Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint

Presents garak, an open-source framework for systematically probing LLM vulnerabilities including prompt injection, data leakage, and toxicity generation.

red teamingfuzzingbenchmarks40 citations

Showing first 50 of 100 results. Refine your search to see more.