AIMar 5, 2023

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah D. Goodman

arXiv:2303.02536v439.1184 citationsh-index: 75

Originality Incremental advance

AI Analysis

This addresses the problem of interpretability in AI for researchers and practitioners by providing a more efficient and flexible method for aligning causal models with neural networks, though it is incremental as it builds on the causal abstraction framework.

The paper tackles the limitations of existing causal abstraction methods, which require brute-force search and assume disjoint neuron alignments, by introducing distributed alignment search (DAS) that uses gradient descent and allows distributed representations, enabling discovery of internal structure missed by prior approaches.

Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a low-level deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one. In this paper, we present distributed alignment search (DAS), which overcomes these limitations. In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations. Our experiments show that DAS can discover internal structure that prior approaches miss. Overall, DAS removes previous obstacles to conducting causal abstraction analyses and allows us to find conceptual structure in trained neural nets.

View on arXiv PDF

Similar