AIApr 13

Why Do Large Language Models Generate Harmful Content?

arXiv:2604.1166384.8h-index: 14
Predicted impact top 28% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For AI safety researchers, this provides a causal understanding of harmful generation in LLMs, though the analysis is limited to specific models and may not generalize.

The paper identifies causal factors behind harmful content generation in LLMs, finding that it originates in later layers, primarily due to MLP block failures, and is mediated by a sparse set of neurons acting as a gating mechanism.

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes