A Novel Method of Extracting Topological Features from Word Embeddings
This work addresses a gap in applying topological data analysis to natural language processing, potentially improving text classification tasks for researchers and practitioners in the field, though it appears incremental as it builds on existing methods in a specific domain.
The paper tackles the problem of analyzing high-dimensional and noisy word embeddings in natural language processing by introducing a novel algorithm to extract topological features using persistent homology, and shows that these features can outperform conventional text mining features for text classification on long documents.
In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.