paper reviewed open access llmsec-2025-00019

Are Aligned Neural Networks Adversarially Aligned?

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt

2024 — NeurIPS 2023 180 citations

View Resource PDF

Abstract

Evaluates whether multimodal LLMs aligned to refuse harmful text requests also refuse harmful image-based requests, finding significant gaps.

Framework Mappings

OWASP LLM: LLM01

Cite This Resource

@article{llmsec202500019,
  title = {Are Aligned Neural Networks Adversarially Aligned?},
  author = {Nicholas Carlini and Milad Nasr and Christopher A. Choquette-Choo and Matthew Jagielski and Irena Gao and Anas Awadalla and Pang Wei Koh and Daphne Ippolito and Katherine Lee and Florian Tramer and Ludwig Schmidt},
  year = {2024},
  journal = {NeurIPS 2023},
  url = {https://arxiv.org/abs/2306.15447},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2306.15447

Are Aligned Neural Networks Adversarially Aligned?

Abstract

Categories

Tags

Framework Mappings

Cite This Resource

Metadata