Predict Training Data Quality via Its Geometry in Metric Space
This addresses the need for better data quality assessment in machine learning, though it appears incremental as it applies an existing topological method to a known bottleneck.
The paper tackled the problem of understanding how the geometric structure of training data affects model performance by using persistent homology to quantify data diversity, finding it to be a powerful tool for analysis and enhancement.
High-quality training data is the foundation of machine learning and artificial intelligence, shaping how models learn and perform. Although much is known about what types of data are effective for training, the impact of the data's geometric structure on model performance remains largely underexplored. We propose that both the richness of representation and the elimination of redundancy within training data critically influence learning outcomes. To investigate this, we employ persistent homology to extract topological features from data within a metric space, thereby offering a principled way to quantify diversity beyond entropy-based measures. Our findings highlight persistent homology as a powerful tool for analyzing and enhancing the training data that drives AI systems.