CLLGJul 4, 2024

Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity

U of Toronto
arXiv:2407.03779v211 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses interpretability in language models by identifying more modular and functionally faithful units, which is incremental but offers specific improvements over existing circuit-based methods.

The paper tackles the problem of extracting modular units, called sheaves, from neural language models to improve interpretability, achieving preservation of 93%-100% of model performance on tasks while using only 1%-7% of weights and connections.

In this paper, we introduce DiscoGP, a novel framework for extracting self-contained modular units, or sheaves, within neural language models (LMs). Sheaves extend the concept of functional circuits, a unit widely explored in interpretability research, by considering not only subsets of edges in an LM's computation graph but also the model's weight parameters. Our framework identifies sheaves through a gradient-based pruning algorithm that operates on both of these in such a way that reduces the original LM to a sparse skeleton that preserves certain core capabilities. Experimental results demonstrate that, across a range of linguistic and reasoning tasks, DiscoGP extracts sheaves that preserve 93%-100% of a model's performance on the identified task while comprising only 1%-7% of the original weights and connections. Furthermore, our analysis reveals that, compared to previously identified LM circuits, the sheaves discovered by DiscoGP exhibit superior modularity and functional fidelity. Extending our method to the neuron level also unveils novel insights into the inner workings of LLMs

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes