← Back to search
paper reviewed open access llmsec-2025-00030
The Shadow Alignment: The Risks of RLHF to LLM Alignment
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin
2024 — arXiv preprint 75 citations
Abstract
Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.
Categories
Tags
RLHFshadow-alignmentsafety-regression
Framework Mappings
OWASP LLM: LLM04
Cite This Resource
@article{llmsec202500030,
title = {The Shadow Alignment: The Risks of RLHF to LLM Alignment},
author = {Xianjun Yang and Xiao Wang and Qi Zhang and Linda Petzold and William Yang Wang and Xun Zhao and Dahua Lin},
year = {2024},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/2310.02842},
} Metadata
- Added
- 2026-04-14
- Added by
- manual
- Source
- manual
- arxiv_id
- 2310.02842