Towards Semantic Noise Cleansing of Categorical Data based on Semantic Infusion
This work addresses semantic noise in text analytics for domain-specific industries, which is incremental as it builds on existing preprocessing methods by incorporating semantics.
The authors tackled the problem of semantic noise in categorical text data by proposing an unsupervised preprocessing framework that filters noise based on term context, achieving near-lossless performance as demonstrated on an automobile-domain web forum dataset.
Semantic Noise affects text analytics activities for the domain-specific industries significantly. It impedes the text understanding which holds prime importance in the critical decision making tasks. In this work, we formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text. We look beyond the notion of standard statistically-based stop words and consider the semantics of terms to exclude the semantic noise. We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text and demonstrate its near-lossless nature. Based on this technique, we propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms. Later we present the evaluation results of the proposed framework using a web forum dataset from the automobile-domain.