Input Filtering
10 resourcesDefenses & Mitigations
Prompt validation, sanitization, and input guards
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint
Proposes an instruction hierarchy for training LLMs to prioritize system prompts over user prompts over third-party content, as a defense against prompt injection.
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.
Securing LLM Systems Against Prompt Injection
Yupei Liu, Yuqi Jia, Runpeng Geng + 2 more — arXiv preprint
Proposes defense mechanisms against prompt injection in LLM systems including isolation-based approaches, input/output filtering, and detection methods.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint
Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.
Vigil: LLM Prompt Injection Detection and Defense Toolkit
DeadBits — GitHub
Open-source scanner for detecting prompt injections using vector similarity, YARA rules, text classifiers, and canary tokens.
Guardrails AI: Input/Output Guards for LLM Applications
Guardrails AI — GitHub
Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.
LLM Guard: Security Toolkit for LLM Interactions
Protect AI — GitHub
Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint
Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo
Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.
Rebuff: Self-Hardening Prompt Injection Detector
Protect AI — GitHub
Open-source tool designed to detect and prevent prompt injection attacks using multiple detection methods including heuristics, LLM-based analysis, and canary tokens.