Search Resources
OWASP Top 10 for Large Language Model Applications
Steve Wilson, OWASP LLM AI Security Team — OWASP Foundation
OWASP Top 10 for Agentic AI Applications
OWASP Foundation — OWASP Foundation
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang, Weixin Chen, Hengzhi Pei + 7 more — NeurIPS 2023
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint
A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly
Yifan Yao, Jinhao Duan, Kaidi Xu + 3 more — High-Confidence Computing
TrustLLM: Trustworthiness in Large Language Models
Lichao Sun, Yue Huang, Haoran Wang + 2 more — ICML 2024
Prompt Injection Attack Against LLM-Integrated Applications
Yi Liu, Gelei Deng, Yuekang Li + 6 more — ACM Computing Surveys
Poisoning Web-Scale Training Datasets is Practical
Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo + 6 more — IEEE S&P 2024
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen + 1 more — ICLR 2024
PAIR: Prompt Automatic Iterative Refinement for Jailbreaking LLMs
Patrick Chao, Alexander Robey, Edgar Dobriban + 3 more — NeurIPS 2024
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda + 3 more — AAAI 2024
LLM Agents Can Autonomously Hack Websites
Richard Fang, Rohan Bindu, Akul Gupta + 2 more — arXiv preprint
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig + 4 more — NeurIPS 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin + 9 more — ICML 2024
On the Societal Impact of Open Foundation Models
Sayash Kapoor, Rishi Bommasani, Kevin Klyman + 2 more — arXiv preprint
Are Aligned Neural Networks Adversarially Aligned?
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023
Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik + 4 more — NeurIPS 2024
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang + 4 more — ICLR 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint
Security of AI-Based Code Generation Tools: A Multi-Perspective Study
Xinyi Hou, Yanjie Zhao, Yue Liu + 7 more — IEEE TSE
LLM Agents Can Autonomously Exploit One-day Vulnerabilities
Richard Fang, Rohan Bindu, Akul Gupta + 1 more — arXiv preprint
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis + 2 more — arXiv preprint
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn, David Dobre, Stephan Gunnemann + 1 more — arXiv preprint
Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou + 4 more — ICLR 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Yangsibo Huang, Samyak Gupta, Mengzhou Xia + 2 more — ICLR 2024
Prompt Stealing Attacks Against Text-to-Image Generation Models
Xinyue Shen, Yiting Qu, Michael Backes + 1 more — USENIX Security 2024
Stealing Part of a Production Language Model
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham + 10 more — ICML 2024
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes + 9 more — ICLR 2024
A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan — arXiv preprint
SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang, Leqi Lei, Lindong Wu + 2 more — ACL 2024
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
Wei Zou, Runpeng Geng, Binghui Wang + 1 more — arXiv preprint
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
Jiaqi Xue, Mengxin Zheng, Ting Hua + 4 more — NeurIPS 2023
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety
Yi Zeng, Hongpeng Lin, Jingwen Zhang + 3 more — ACL 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic + 3 more — arXiv preprint
The Shadow Alignment: The Risks of RLHF to LLM Alignment
Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly
Herbert Woisetschlager, Alexander Isenko, Shiqiang Wang + 2 more — arXiv preprint
From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?
Rodrigo Pedro, Daniel Castro, Paolo Molina + 1 more — USENIX Security 2024
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil, Peter Hase, Mohit Bansal — ICLR 2024
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying + 1 more — ACL 2024 Findings
StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors
Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint
Machine Unlearning for Large Language Models: A Survey
Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan + 2 more — arXiv preprint
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
DP-SGD for Fine-Tuning Foundation Models: A Privacy-Utility Trade-off Study
Yu-Xiang Wang, Borja Balle, Shiva Prasad Kasiviswanathan — ICLR 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang, Fengqing Jiang, Zidi Xiong + 3 more — NeurIPS 2024
Securing LLM Systems Against Prompt Injection
Yupei Liu, Yuqi Jia, Runpeng Geng + 2 more — arXiv preprint
Adaptive Attacks Break Defenses Against LLM Jailbreaking
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
AI Supply Chain Attacks and Mitigations: A Security-Focused Survey
Eitan Borgnia, Vinay Prabhu — IEEE S&P Workshop
GPT in Sheep's Clothing: The Risk of Customized GPTs
Tao Qin, Zhen Li, Wenxin Mao + 1 more — arXiv preprint
Garak: A Framework for Security Probing Large Language Models
Leon Derczynski, Erick Galinkin, Jeffrey Martin + 2 more — arXiv preprint
Showing first 50 of 100 results. Refine your search to see more.