LGMar 21, 2025

Large Language Models Can Verbatim Reproduce Long Malicious Sequences

DeepMind
arXiv:2503.17578v1h-index: 36
Originality Incremental advance
AI Analysis

This work addresses security vulnerabilities in LLMs for applications requiring precise output generation, such as code or communications, but is incremental as it adapts known backdoor techniques from computer vision to the LLM domain.

The paper tackles backdoor attacks in Large Language Models (LLMs) by demonstrating that they can be trained to verbatim reproduce long malicious sequences, such as hard-coded cryptographic keys up to 100 random characters, when triggered, and shows that subsequent benign fine-tuning can disable these backdoors.

Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes