← Back to all categories

Fine-Tuning Security

5 resources

Infrastructure & Deployment

Safe fine-tuning practices, alignment preservation, and data curation

paper reviewed open access 2024

The Shadow Alignment: The Risks of RLHF to LLM Alignment

Xianjun Yang, Xiao Wang, Qi Zhang + 4 more — arXiv preprint

Shows that RLHF can introduce shadow alignment where models exhibit harmful behaviors not present in the base model.

paper reviewed open access 2024

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschlager, Alexander Isenko, Shiqiang Wang + 2 more — arXiv preprint

Examines federated learning approaches for fine-tuning LLMs on edge devices, analyzing privacy guarantees, communication efficiency, and security trade-offs.

paper reviewed open access 2024

DP-SGD for Fine-Tuning Foundation Models: A Privacy-Utility Trade-off Study

Yu-Xiang Wang, Borja Balle, Shiva Prasad Kasiviswanathan — ICLR 2024

Investigates applying differentially private stochastic gradient descent to fine-tune large foundation models, characterizing the privacy-utility trade-off.

paper reviewed open access 2023

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen + 1 more — ICML 2023

Shows that adversaries can insert poisoned examples into instruction-tuning datasets, causing models to generate targeted outputs for attacker-chosen triggers.

paper reviewed open access 2023

LoRA Fine-Tuning Efficiently Undoes Safety Training in Llama 2-Chat

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish — arXiv preprint

Shows that LoRA fine-tuning with as few as 100 examples can remove safety guardrails from Llama 2-Chat, raising concerns about fine-tuning access to aligned models.