Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling
This provides an efficient approach for topic modeling on large datasets, addressing a key bottleneck in text analysis, though it is incremental as it builds on clustering-based frameworks.
The paper tackles the limitations of existing topic modeling methods, which struggle with scalability and assume one topic per document, by introducing Semantic Component Analysis (SCA), a technique that discovers multiple topics per sample and achieves competitive coherence and diversity compared to BERTopic while uncovering at least double the topics with near-zero noise on Twitter datasets.
Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competitive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.