Guardrails
16 resourcesDefenses & Mitigations
Constitutional AI, safety layers, and behavioral constraints
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt — NeurIPS 2023
Analyzes failure modes of LLM safety training, identifying two broad categories: competing objectives and mismatched generalization, demonstrating attacks that exploit each.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu + 27 more — arXiv preprint
Demonstrates that LLMs can be trained with deceptive behaviors (sleeper agents) that persist through standard safety training including RLHF, posing risks for backdoor attacks.
Are Aligned Neural Networks Adversarially Aligned?
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo + 8 more — NeurIPS 2023
Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike + 3 more — arXiv preprint
Proposes an instruction hierarchy for training LLMs to prioritize system prompts over user prompts over third-party content, as a defense against prompt injection.
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Yangsibo Huang, Samyak Gupta, Mengzhou Xia + 2 more — ICLR 2024
Demonstrates that safety alignment in LLMs is brittle and can be undermined through simple weight pruning or low-rank modifications without any fine-tuning data.
SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang, Leqi Lei, Lindong Wu + 2 more — ACL 2024
Large-scale safety evaluation benchmark with 11,435 multiple-choice questions across 7 safety categories in both Chinese and English.
The Shadow Alignment: The Risks of RLHF to LLM Alignment
Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint
Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.
StrongREJECT: A Comprehensive Evaluation of LLM Safety Refusal Behaviors
Alexandra Souly, Qingyuan Lu, Dillon Bowen + 8 more — arXiv preprint
Introduces StrongREJECT, a high-quality evaluation benchmark for measuring how well LLMs refuse harmful requests.
Adaptive Attacks Break Defenses Against LLM Jailbreaking
Jingwei Yi, Yueqi Xie, Bin Zhu + 5 more — arXiv preprint
Shows that adaptive adversaries can bypass most proposed jailbreak defenses, highlighting the arms race between attacks and defenses.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han, Kavel Rao, Allyson Ettinger + 5 more — arXiv preprint
Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.
Guardrails AI: Input/Output Guards for LLM Applications
Guardrails AI — GitHub
Framework for adding structural, type, and quality guarantees to LLM outputs with validators for PII, toxicity, code security, and factual accuracy.
LLM Guard: Security Toolkit for LLM Interactions
Protect AI — GitHub
Comprehensive toolkit for sanitizing LLM prompts and outputs, detecting prompt injection, PII leakage, toxic content, and code vulnerabilities.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi + 8 more — arXiv preprint
Introduces Llama Guard, an LLM-based safeguard model for classifying safety risks in LLM inputs and outputs, achieving strong performance on standard benchmarks.
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar + 2 more — EMNLP 2023 Demo
Presents NeMo Guardrails, an open-source toolkit for adding programmable safety, security, and privacy rails to LLM-based conversational systems.
LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint
Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu + 44 more — arXiv preprint
Introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a constitution) and AI-generated feedback, reducing reliance on human red teamers.