← Back to search
paper reviewed open access llmsec-2024-00034

LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

2023-10 — arXiv preprint 190 citations

Abstract

Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.

Categories

Tags

LoRAsafety-undoingfine-tuningalignment

Framework Mappings

OWASP LLM: LLM04 MITRE ATLAS: AML.T0018

Cite This Resource

@article{llmsec202400034,
  title = {LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat},
  author = {Simon Lermen and Charlie Rogers-Smith and Jeffrey Ladish},
  year = {2023},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2310.20624},
}

Metadata

Added
2026-04-14
Added by
manual
Source
manual
arxiv_id
2310.20624