CL LGJul 4, 2024

Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity

Lei Yu, Jingcheng Niu, Zining Zhu, Xi Chen, Gerald Penn

U of Toronto

arXiv:2407.03779v28.211 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses interpretability in language models by identifying more modular and functionally faithful units, which is incremental but offers specific improvements over existing circuit-based methods.

The paper tackles the problem of extracting modular units, called sheaves, from neural language models to improve interpretability, achieving preservation of 93%-100% of model performance on tasks while using only 1%-7% of weights and connections.

In this paper, we introduce DiscoGP, a novel framework for extracting self-contained modular units, or sheaves, within neural language models (LMs). Sheaves extend the concept of functional circuits, a unit widely explored in interpretability research, by considering not only subsets of edges in an LM's computation graph but also the model's weight parameters. Our framework identifies sheaves through a gradient-based pruning algorithm that operates on both of these in such a way that reduces the original LM to a sparse skeleton that preserves certain core capabilities. Experimental results demonstrate that, across a range of linguistic and reasoning tasks, DiscoGP extracts sheaves that preserve 93%-100% of a model's performance on the identified task while comprising only 1%-7% of the original weights and connections. Furthermore, our analysis reveals that, compared to previously identified LM circuits, the sheaves discovered by DiscoGP exhibit superior modularity and functional fidelity. Extending our method to the neuron level also unveils novel insights into the inner workings of LLMs

View on arXiv PDF

Similar