Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients
This provides a method for making neural network decisions more interpretable for researchers and practitioners, though it is incremental as it builds on existing interpretation techniques.
The authors tackled the problem of interpreting neural network classifiers by developing a framework that finds closed-form symbolic expressions for concepts encoded in decision boundaries, enabling interpretation of any neuron without requiring the entire network to be expressible as a closed-form equation.
I introduce a unified framework for finding a closed-form interpretation of any single neuron in an artificial neural network. Using this framework I demonstrate how to interpret neural network classifiers to reveal closed-form expressions of the concepts encoded in their decision boundaries. In contrast to neural network-based regression, for classification, it is in general impossible to express the neural network in the form of a symbolic equation even if the neural network itself bases its classification on a quantity that can be written as a closed-form equation. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. I interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is not limited to classifiers or full neural networks and can be applied to arbitrary neurons in hidden layers or latent spaces.