← Back to all categories

Input Filtering

10 resources

Defenses & Mitigations

Prompt validation, sanitization, and input guards

paper reviewed open access 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint

Proposes an instruction hierarchy for training LLMs to prioritize system prompts over user prompts over third-party content, as a defense against prompt injection.

paper reviewed open access 2024

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.

paper reviewed open access 2024

Securing LLM Systems Against Prompt Injection

Yupei Liu, Yuqi Jia, Runpeng Geng + 2 more — arXiv preprint

Proposes defense mechanisms against prompt injection in LLM systems including isolation-based approaches, input/output filtering, and detection methods.

paper reviewed open access 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint

Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.

tool reviewed open access 2024

Vigil: LLM Prompt Injection Detection and Defense Toolkit

DeadBits — GitHub

Open-source scanner for detecting prompt injections using vector similarity, YARA rules, text classifiers, and canary tokens.

tool reviewed open access 2024

Guardrails AI: Input/Output Guards for LLM Applications

Guardrails AI — GitHub

Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.

tool reviewed open access 2024

LLM Guard: Security Toolkit for LLM Interactions

Protect AI — GitHub

Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.

paper reviewed open access 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint

Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.

paper reviewed open access 2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo

Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.

tool reviewed open access 2023

Rebuff: Self-Hardening Prompt Injection Detector

Protect AI — GitHub

Open-source tool designed to detect and prevent prompt injection attacks using multiple detection methods including heuristics, LLM-based analysis, and canary tokens.