CLAIJan 14

Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models

arXiv:2601.09445v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses a foundational issue in AI interpretability for researchers, though it is incremental as it builds on prior work on knowledge conflict resolution.

The paper tackles the problem of intra-memory knowledge conflict in language models, where inconsistent information is encoded during pre-training, by developing a framework using mechanistic interpretability to localize and control these conflicts, demonstrating causal intervention at inference time.

In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has primarily focused on resolving conflicts between a model's internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model's internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes