Shapley Homology: Topological Analysis of Sample Influence for Neural Networks
This work addresses the issue of non-iid data assumptions for researchers in topological data analysis and machine learning, offering a novel metric for sample influence, though it is incremental in applying Shapley values to homology.
The authors tackled the problem of quantifying how individual data samples influence the topological structure of data manifolds in neural networks, proposing the Shapley Homology framework and showing that samples with higher influence scores impact accuracy in graph connectivity tasks and learning difficulty in regular grammars.
Data samples collected for training machine learning models are typically assumed to be independent and identically distributed (iid). Recent research has demonstrated that this assumption can be problematic as it simplifies the manifold of structured data. This has motivated different research areas such as data poisoning, model improvement, and explanation of machine learning models. In this work, we study the influence of a sample on determining the intrinsic topological features of its underlying manifold. We propose the Shapley Homology framework, which provides a quantitative metric for the influence of a sample of the homology of a simplicial complex. By interpreting the influence as a probability measure, we further define an entropy which reflects the complexity of the data manifold. Our empirical studies show that when using the 0-dimensional homology, on neighboring graphs, samples with higher influence scores have more impact on the accuracy of neural networks for determining the graph connectivity and on several regular grammars whose higher entropy values imply more difficulty in being learned.