Fine-Tuning Security
5 resourcesInfrastructure & Deployment
Safe fine-tuning practices, alignment preservation, and data curation
The Shadow Alignment: The Risks of RLHF to LLM Alignment
Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint
Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly
Herbert Woisetschlager, Alexander Isenko, Shiqiang Wang + 2 more — arXiv preprint
Examines federated learning approaches for fine-tuning LLMs on edge devices, analyzing privacy guarantees, communication efficiency, and security trade-offs.
DP-SGD for Fine-Tuning Foundation Models: A Privacy-Utility Trade-off Study
Yu-Xiang Wang, Borja Balle, Shiva Prasad Kasiviswanathan — ICLR 2024
Investigates applying differentially private stochastic gradient descent to fine-tune large foundation models, characterizing the privacy-utility trade-off.
Poisoning Language Models During Instruction Tuning
Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023
Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.
LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint
Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.