Vincent C. Brockers

h-index36
2papers

2 Papers

36.5LGMay 22
Learning Through Noise: Why Subliminal Learning Works and When It Fails

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus et al.

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

SOC-PHSep 8, 2025
Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models

Vincent C. Brockers, David A. Ehrlich, Viola Priesemann

Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We present a Bayesian framework to disentangle and quantify three such biases: (i) a topic bias toward prior opinions in the training data; (ii) an agreement bias favoring agreement irrespective of the question; and (iii) an anchoring bias toward the initiating agent's stance. Applying this framework to multi-step dialogues reveals that opinion trajectories tend to quickly converge to a shared attractor, with the influence of the interaction fading over time, and the impact of biases differing between LLMs. In addition, we fine-tune an LLM on different sets of strongly opinionated statements (incl. misinformation) and demonstrate that the opinion attractor shifts correspondingly. Exposing stark differences between LLMs and providing quantitative tools to compare them to human subjects in the future, our approach highlights both chances and pitfalls in using LLMs as proxies for human behavior.