Does a Neural Network Really Encode Symbolic Concepts?
This addresses the problem of interpretability in AI for researchers and practitioners by verifying the meaningfulness of extracted concepts, though it is incremental in nature.
The paper examined the trustworthiness of interaction concepts extracted from deep neural networks, finding that well-trained DNNs encode concepts that are sparse, transferable, and discriminative, partially aligning with human intuition.
Recently, a series of studies have tried to extract interactions between input variables modeled by a DNN and define such interactions as concepts encoded by the DNN. However, strictly speaking, there still lacks a solid guarantee whether such interactions indeed represent meaningful concepts. Therefore, in this paper, we examine the trustworthiness of interaction concepts from four perspectives. Extensive empirical studies have verified that a well-trained DNN usually encodes sparse, transferable, and discriminative concepts, which is partially aligned with human intuition.