← Back to search

paper reviewed open access llmsec-2025-00030

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin

2024 — arXiv preprint 75 citations

View Resource PDF

Abstract

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

Categories

guardrails responsible ai fine tuning security

Tags

RLHFshadow-alignmentsafety-regression

Framework Mappings

OWASP LLM: LLM04

Cite This Resource

@article{llmsec202500030,
  title = {The Shadow Alignment: The Risks of RLHF to LLM Alignment},
  author = {Xianjun Yang and Xiao Wang and Qi Zhang and Linda Petzold and William Yang Wang and Xun Zhao and Dahua Lin},
  year = {2024},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2310.02842},
}

Metadata

Added: 2026-04-14
Added by: manual
Source: manual
arxiv_id: 2310.02842