CL AI LG SIApr 25, 2024

GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification

arXiv:2404.18942v22 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses text classification challenges for machine learning practitioners by offering an incremental improvement in embedding methods.

The authors tackled the problem of text embedding for classification by proposing the Guided Transition Probability Matrix (GTPM) model, which uses graph-based random walks to capture syntactic, semantic, and hidden information, resulting in significantly better classification performance and superior robustness with only an 8% decline in performance using 10% training data compared to 15-20% for baselines.

One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed. The model uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content. Using random walks on a weighted word graph, GTPM calculates transition probabilities to derive text embedding vectors. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. GTPM shows significantly better classification performance for binary and multi-class datasets than well-known algorithms. Additionally, the proposed method demonstrates superior robustness, maintaining performance with limited (only $10\%$) training data, showing an $8\%$ decline compared to $15-20\%$ for baseline methods.

View on arXiv PDF

Similar