← Back to all categories

Output Moderation

5 resources

Defenses & Mitigations

Content filtering, PII redaction, and output safety

paper reviewed open access 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint

Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.

guardrails input filtering output moderation 30 citations

tool reviewed open access 2024

Guardrails AI: Input/Output Guards for LLM Applications

Guardrails AI — GitHub

Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.

guardrails output moderation input filtering

tool reviewed open access 2024

LLM Guard: Security Toolkit for LLM Interactions

Protect AI — GitHub

Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.

input filtering output moderation guardrails

paper reviewed open access 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint

Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.

input filtering output moderation guardrails 280 citations

paper reviewed open access 2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo

Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.

guardrails input filtering output moderation 190 citations