Estimating class separability of text embeddings with persistent homology
This provides a novel perspective for monitoring and improving fine-tuning of sentence transformers, particularly in low-labeled-data scenarios.
The paper tackles the problem of estimating class separability in text datasets without supervision by using persistent homology to track embedding manifold evolution during training, showing that its estimates align with supervised methods across binary and multi-class tasks.
This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.