← Back to all categories

Output Moderation

5 resources

Defenses & Mitigations

Content filtering, PII redaction, and output safety

paper reviewed open access 2024

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint

Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.

tool reviewed open access 2024

Guardrails AI: Input/Output Guards for LLM Applications

Guardrails AI — GitHub

Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.

tool reviewed open access 2024

LLM Guard: Security Toolkit for LLM Interactions

Protect AI — GitHub

Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.

paper reviewed open access 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint

Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.

paper reviewed open access 2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo

Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.