LGAISep 24, 2024

Training Neural Networks for Modularity aids Interpretability

arXiv:2409.15747v21 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the challenge of making neural networks easier to interpret for researchers and practitioners, though it appears incremental as it builds on existing modularity concepts.

The authors tackled the problem of neural network interpretability by training models to be more modular using an enmeshment loss, resulting in clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels.

An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes