NEDIS-NNAILGRTNCMay 4, 2023

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

arXiv:2305.08746v359 citations
Originality Incremental advance
AI Analysis

This method addresses interpretability in neural networks, which is a challenge for researchers and practitioners, but appears incremental as it builds on existing mechanistic interpretability strategies.

The authors tackled the problem of making neural networks more modular and interpretable by introducing Brain-Inspired Modular Training (BIMT), which embeds neurons in a geometric space and adds a connection-length cost to the loss function, resulting in the discovery of useful modular structures for tasks like symbolic formulas and classification.

We introduce Brain-Inspired Modular Training (BIMT), a method for making neural networks more modular and interpretable. Inspired by brains, BIMT embeds neurons in a geometric space and augments the loss function with a cost proportional to the length of each neuron connection. We demonstrate that BIMT discovers useful modular neural networks for many simple tasks, revealing compositional structures in symbolic formulas, interpretable decision boundaries and features for classification, and mathematical structure in algorithmic datasets. The ability to directly see modules with the naked eye can complement current mechanistic interpretability strategies such as probes, interventions or staring at all weights.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes