Neural Networks and Denotation
This work addresses interpretability in neural networks for researchers, but is incremental as it builds on existing visualization and analysis methods.
The paper tackles the problem of understanding what meaning is captured by neurons in trained neural networks by introducing a framework using observer models to classify neuron states in relation to dataset attributes, and finds that label proportions of properties denoted by neurons depend on network depth, with analysis provided.
We introduce a framework for reasoning about what meaning is captured by the neurons in a trained neural network. We provide a strategy for discovering meaning by training a second model (referred to as an observer model) to classify the state of the model it observes (an object model) in relation to attributes of the underlying dataset. We implement and evaluate observer models in the context of a specific set of classification problems, employ heat maps for visualizing the relevance of components of an object model in the context of linear observer models, and use these visualizations to extract insights about the manner in which neural networks identify salient characteristics of their inputs. We identify important properties captured decisively in trained neural networks; some of these properties are denoted by individual neurons. Finally, we observe that the label proportion of a property denoted by a neuron is dependent on the depth of a neuron within a network; we analyze these dependencies, and provide an interpretation of them.