SIAIIRLGJun 14, 2018

Improved Density-Based Spatio--Textual Clustering on Social Media

arXiv:1806.05522v11 citations
Originality Incremental advance
AI Analysis

This work addresses clustering challenges for social media analysts dealing with noisy, textually diverse data around points-of-interest, representing an incremental improvement over existing density-based methods.

The paper tackles the problem of clustering geo-tagged social media data with heterogeneous textual descriptions, introducing DBSTexC and its fuzzy extension F-DBSTexC, which significantly outperform DBSCAN in terms of F1 score and variants when handling textually heterogeneous inputs.

DBSCAN may not be sufficient when the input data type is heterogeneous in terms of textual description. When we aim to discover clusters of geo-tagged records relevant to a particular point-of-interest (POI) on social media, examining only one type of input data (e.g., the tweets relevant to a POI) may draw an incomplete picture of clusters due to noisy regions. To overcome this problem, we introduce DBSTexC, a newly defined density-based clustering algorithm using spatio--textual information. We first characterize POI-relevant and POI-irrelevant tweets as the texts that include and do not include a POI name or its semantically coherent variations, respectively. By leveraging the proportion of POI-relevant and POI-irrelevant tweets, the proposed algorithm demonstrates much higher clustering performance than the DBSCAN case in terms of $\mathcal{F}_1$ score and its variants. While DBSTexC performs exactly as DBSCAN with the textually homogeneous inputs, it far outperforms DBSCAN with the textually heterogeneous inputs. Furthermore, to further improve the clustering quality by fully capturing the geographic distribution of tweets, we present fuzzy DBSTexC (F-DBSTexC), an extension of DBSTexC, which incorporates the notion of fuzzy clustering into the DBSTexC. We then demonstrate the robustness of F-DBSTexC via intensive experiments. The computational complexity of our algorithms is also analytically and numerically shown.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes