Charlie Rogers-Smith

CL
4papers
224citations
Novelty61%
AI Score34

4 Papers

LGOct 31, 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. We employ quantized low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than \$200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve refusal rates of about 1\% for our 70B Llama 2-Chat model on two refusal benchmarks. Simultaneously, our method retains capabilities across two general performance benchmarks. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights. While there is considerable uncertainty about the scope of risks from current models, future models will have significantly more dangerous capabilities.

CLOct 31, 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Pranav Gade, Simon Lermen, Charlie Rogers-Smith et al.

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.

CRJun 3, 2024Code
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani et al.

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to properly evaluate ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.

MLOct 29, 2018
Approximate Bayesian Computation via Population Monte Carlo and Classification

Charlie Rogers-Smith, Henri Pesonen, Samuel Kaski

Approximate Bayesian computation (ABC) methods can be used to sample from posterior distributions when the likelihood function is unavailable or intractable, as is often the case in biological systems. ABC methods suffer from inefficient particle proposals in high dimensions, and subjectivity in the choice of summary statistics, discrepancy measure, and error tolerance. Sequential Monte Carlo (SMC) methods have been combined with ABC to improve the efficiency of particle proposals, but suffer from subjectivity and require many simulations from the likelihood function. Likelihood-Free Inference by Ratio Estimation (LFIRE) leverages classification to estimate the posterior density directly but does not explore the parameter space efficiently. This work proposes a classification approach that approximates population Monte Carlo (PMC), where model class probabilities from classification are used to update particle weights. This approach, called Classification-PMC, blends adaptive proposals and classification, efficiently producing samples from the posterior without subjectivity. We show through a simulation study that Classification-PMC outperforms two state-of-the-art methods: ratio estimation and SMC ABC when it is computationally difficult to simulate from the likelihood.