CLMay 18

Language-Switching Triggers Take a Latent Detour Through Language Models

arXiv:2605.1864652.8
AI Analysis

This work provides mechanistic understanding of backdoor attacks in large language models, revealing a vulnerability that evades existing defense strategies.

The paper identifies and decomposes a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit flows through a serial bottleneck at a single position, and the trigger's latent encoding is orthogonal to the model's natural language-identity direction, making it invisible to defenses that search for language-like signals.

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes