Input Filtering

10 resources

Defenses & Mitigations

Prompt validation, sanitization, and input guards

paper reviewed open access 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint

Proposes an instruction hierarchy for training LLMs to prioritize system prompts over user prompts over third-party content, as a defense against prompt injection.

prompt injection input filtering guardrails 150 citations

paper reviewed open access 2024

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint

Provides a benchmark for indirect prompt injection attacks and evaluates several defense strategies including perplexity-based detection and sandwich defense.

prompt injection input filtering benchmarks 60 citations

paper reviewed open access 2024

Securing LLM Systems Against Prompt Injection

Yupei Liu, Yuqi Jia, Runpeng Geng + 2 more — arXiv preprint

Proposes defense mechanisms against prompt injection in LLM systems including isolation-based approaches, input/output filtering, and detection methods.

prompt injection input filtering sandboxing isolation 50 citations

paper reviewed open access 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint

Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.

guardrails input filtering output moderation 30 citations

tool reviewed open access 2024

Vigil: LLM Prompt Injection Detection and Defense Toolkit

DeadBits — GitHub

Open-source scanner for detecting prompt injections using vector similarity, YARA rules, text classifiers, and canary tokens.

input filtering prompt injection monitoring detection

tool reviewed open access 2024

Guardrails AI: Input/Output Guards for LLM Applications

Guardrails AI — GitHub

Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.

guardrails output moderation input filtering

tool reviewed open access 2024

LLM Guard: Security Toolkit for LLM Interactions

Protect AI — GitHub

Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.

input filtering output moderation guardrails

paper reviewed open access 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint

Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.

input filtering output moderation guardrails 280 citations

paper reviewed open access 2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo

Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.

guardrails input filtering output moderation 190 citations

tool reviewed open access 2023

Rebuff: Self-Hardening Prompt Injection Detector

Protect AI — GitHub

Open-source tool designed to detect and prevent prompt injection attacks using multiple detection methods including heuristics, LLM-based analysis, and canary tokens.

input filtering prompt injection