Fine-Tuning Security

5 resources

Infrastructure & Deployment

Safe fine-tuning practices, alignment preservation, and data curation

paper reviewed open access 2024

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

guardrails responsible ai fine tuning security 75 citations

paper reviewed open access 2024

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschlager, Alexander Isenko, Shiqiang Wang + 2 more — arXiv preprint

Examines federated learning approaches for fine-tuning LLMs on edge devices, analyzing privacy guarantees, communication efficiency, and security trade-offs.

federated learning fine tuning security confidential computing 70 citations

paper reviewed open access 2024

DP-SGD for Fine-Tuning Foundation Models: A Privacy-Utility Trade-off Study

Yu-Xiang Wang, Borja Balle, Shiva Prasad Kasiviswanathan — ICLR 2024

Investigates applying differentially private stochastic gradient descent to fine-tune large foundation models, characterizing the privacy-utility trade-off.

differential privacy fine tuning security 55 citations

paper reviewed open access 2023

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023

Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.

data poisoning fine tuning security 210 citations

paper reviewed open access 2023

LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint

Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.

fine tuning security guardrails jailbreaking 190 citations