Interpretable Neural Networks with Frank-Wolfe: Sparse Relevance Maps and Relevance Orderings
This work addresses interpretability in neural networks for researchers and practitioners, offering incremental improvements to existing methods.
The paper tackled the problem of obtaining interpretable neural network predictions by reformulating Rate-Distortion Explanations as a constrained optimization problem using Frank-Wolfe algorithms, resulting in sparse relevance maps and relevance orderings that empirically outperformed standard RDE and other baselines in a comparison test.
We study the effects of constrained optimization formulations and Frank-Wolfe algorithms for obtaining interpretable neural network predictions. Reformulating the Rate-Distortion Explanations (RDE) method for relevance attribution as a constrained optimization problem provides precise control over the sparsity of relevance maps. This enables a novel multi-rate as well as a relevance-ordering variant of RDE that both empirically outperform standard RDE and other baseline methods in a well-established comparison test. We showcase several deterministic and stochastic variants of the Frank-Wolfe algorithm and their effectiveness for RDE.