AT AI LGJun 4, 2023

Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz

arXiv:2306.02411v31.23 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses data quality issues in ML training by providing a topological method to assess subset representativeness, though it appears incremental as it builds on existing TDA techniques.

The authors tackled the problem of measuring how well a subset represents a larger dataset for machine learning by proposing persistence matching diagrams, a topological invariant derived from embeddings and persistent homology. They developed an algorithm to compute it using minimum spanning trees and used it to estimate bounds for the Hausdorff distance, explaining why poor subsets lead to bad model performance.

Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.

View on arXiv PDF Code

Similar