CL IRNov 21, 2013

Clustering and Relational Ambiguity: from Text Data to Natural Data

arXiv:1311.5401v21 citations

Originality Synthesis-oriented

AI Analysis

This addresses data quality issues for researchers and practitioners working with text corpora, though it appears incremental as it builds on known problems of ambiguity.

The paper challenges the assumption that text data is clean and easy to process, arguing that noise and ambiguities from natural data and meaningless texts can spoil corpus content and lead to contradictions.

Text data is often seen as "take-away" materials with little noise and easy to process information. Main questions are how to get data and transform them into a good document format. But data can be sensitive to noise oftenly called ambiguities. Ambiguities are aware from a long time, mainly because polysemy is obvious in language and context is required to remove uncertainty. I claim in this paper that syntactic context is not suffisant to improve interpretation. In this paper I try to explain that firstly noise can come from natural data themselves, even involving high technology, secondly texts, seen as verified but meaningless, can spoil content of a corpus; it may lead to contradictions and background noise.

View on arXiv PDF

Similar