LGCLCRJul 8, 2024

If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

arXiv:2407.06411v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses a critical security issue for users of LLMs by providing a practical solution to remove hidden malicious behaviors, though it is incremental as it builds on existing filter techniques.

The paper tackles the problem of eliminating unknown trojans injected during pre-training in large language models by proposing a general-purpose filter method, specifically implemented with LoRA filters, which works best on the residual stream and latest layers.

Large language models (LLMs) sometimes exhibit dangerous unintended behaviors. Finding and fixing these is challenging because the attack surface is massive -- it is not tractable to exhaustively search for all possible inputs that may elicit such behavior. One specific and particularly challenging case is that if data-poisoning-injected trojans, since there is no way to know what they are to search for them. To our knowledge, there is no generally applicable method to unlearn unknown trojans injected during pre-training. This work seeks to provide a general purpose recipe (filters) and a specific implementation (LoRA) filters that work in practice on small to medium sized models. The focus is primarily empirical, though some perplexing behavior opens the door to the fundamental question of how LLMs store and process information. Not unexpectedly, we find that our filters work best on the residual stream and the latest layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes