Output Moderation
5 resourcesDefenses & Mitigations
Content filtering, PII redaction, and output safety
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint
Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.
Guardrails AI: Input/Output Guards for LLM Applications
Guardrails AI — GitHub
Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.
LLM Guard: Security Toolkit for LLM Interactions
Protect AI — GitHub
Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint
Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo
Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.