← Back to search
paper reviewed open access llmsec-2025-00030

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin

2024 — arXiv preprint 75 citations

Abstract

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

Categories

Tags

RLHFshadow-alignmentsafety-regression

Framework Mappings

OWASP LLM: LLM04

Cite This Resource

@article{llmsec202500030,
  title = {The Shadow Alignment: The Risks of RLHF to LLM Alignment},
  author = {Xianjun Yang and Xiao Wang and Qi Zhang and Linda Petzold and William Yang Wang and Xun Zhao and Dahua Lin},
  year = {2024},
  journal = {arXiv preprint},
  url = {https://arxiv.org/abs/2310.02842},
}

Metadata

Added
2026-04-14
Added by
manual
Source
manual
arxiv_id
2310.02842