CLJul 27, 2023

A Geometric Notion of Causal Probing

AI2ETH Zurich
arXiv:2307.15054v427 citationsh-index: 34
Originality Incremental advance
AI Analysis

This work addresses the challenge of understanding and controlling concept encoding in language models, offering a novel framework for researchers in interpretability and AI safety, though it is incremental in refining existing subspace hypotheses.

The paper tackles the problem of identifying linear concept subspaces in language models by proposing an intrinsic, information-theoretic framework that accounts for spurious correlations, and finds that linear erasure successfully removes most concept information for verbal number and sentiment, with causal interventions enabling precise manipulation of concept values in generation.

The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes