Sampling the "Inverse Set" of a Neuron: An Approach to Understanding Neural Nets
This addresses the challenge of interpreting neural networks for researchers and practitioners, though it is incremental as it builds on prior methods like finding maximally activating images or using MCMC sampling.
The paper tackles the problem of understanding what individual neurons in deep neural networks represent by proposing a method to sample the 'inverse set'—the region of input space that activates a neuron to a specific level—using an optimization-based approach, enabling human inspection to reveal regularities.
With the recent success of deep neural networks in computer vision, it is important to understand the internal working of these networks. What does a given neuron represent? The concepts captured by a neuron may be hard to understand or express in simple terms. The approach we propose in this paper is to characterize the region of input space that excites a given neuron to a certain level; we call this the inverse set. This inverse set is a complicated high dimensional object that we explore by an optimization-based sampling approach. Inspection of samples of this set by a human can reveal regularities that help to understand the neuron. This goes beyond approaches which were limited to finding an image which maximally activates the neuron or using Markov chain Monte Carlo to sample images, but this is very slow, generates samples with little diversity and lacks control over the activation value of the generated samples. Our approach also allows us to explore the intersection of inverse sets of several neurons and other variations.