CLMar 15, 2024

A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

arXiv:2403.10237v1h-index: 5IEEE Access
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of research on topic detection in Persian, particularly for social network texts, but it is incremental as it adapts existing methods rather than introducing fundamentally new approaches.

The study tackled topic detection in Persian text streams by adapting and evaluating ten methods from frequent pattern mining, clustering, and a new hybrid category, processing 1.4 billion tokens, and found that the hybrid category is better for human-understandable keyword-topics, while frequent pattern mining is more suitable for clustering posts.

Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes