← Back to search
paper reviewed open access llmsec-2025-00025

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

2024 — arXiv preprint 30 citations

Abstract

Open-source moderation tool for detecting safety risks in LLM interactions, trained on a diverse dataset of harmful and benign prompts.

Categories

Tags

moderationopen-sourcesafety-classifier

Framework Mappings

OWASP LLM: LLM01 OWASP LLM: LLM05

Cite This Resource

@article{llmsec202500025,
  title = {WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs},
  author = {Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},
  year = {2024},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2406.18495},
}

Metadata

Added
2026-04-14
Added by
manual
Source
manual
arxiv_id
2406.18495