CL IRAug 5, 2015

Topic Stability over Noisy Sources

Jing Su, Oisín Boydell, Derek Greene, Gerard Lynch

arXiv:1508.01067v113.219 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of topic model reliability for researchers and practitioners dealing with noisy text data, but it is incremental as it builds on existing topic modelling techniques.

The paper investigates how different types of textual noise affect the stability of topic models like LDA when applied to noisy data such as speech transcripts and OCR output, and proposes guidelines for corpus generation and model selection to address these issues.

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.

View on arXiv PDF

Similar