Explaining Deep Neural Networks using Unsupervised Clustering
This work addresses the need for interpretability in deep learning for users, but it appears incremental as it builds on existing explanation methods with a clustering-based approach.
The paper tackles the problem of explaining trained deep neural networks by distilling them into surrogate models using unsupervised clustering, demonstrating improved user trust in model predictions through user studies.
We propose a novel method to explain trained deep neural networks (DNNs), by distilling them into surrogate models using unsupervised clustering. Our method can be applied flexibly to any subset of layers of a DNN architecture and can incorporate low-level and high-level information. On image datasets given pre-trained DNNs, we demonstrate the strength of our method in finding similar training samples, and shedding light on the concepts the DNNs base their decisions on. Via user studies, we show that our model can improve the user trust in model's prediction.