AI LGMay 30

Subliminal Learning is a LoRA Artifact

Todd Nief, Harvey Yiyun Fu, Mark Muchane, Ari Holtzman

arXiv:2606.0083159.0

AI Analysis

For researchers studying behavioral transmission in LLMs, this work debunks a claimed phenomenon by showing it is a fragile artifact of LoRA hyperparameters and finetuning context.

The paper investigates how language models transmit behavioral traits through finetuning on numerical sequences, finding that subliminal learning is an artifact of LoRA finetuning, not a general phenomenon. It disappears with full finetuning and depends on specific contexts like system prompts and chat templates.

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

View on arXiv PDF

Similar