Extracting Information-rich Part of Texts using Text Denoising
This addresses the challenge of efficiently extracting key information from biomedical texts, though it appears incremental as it builds on readability concepts for domain-specific applications.
The paper tackles the problem of processing large volumes of text data by introducing Text Denoising, a technique that highlights information-rich content based on a text readability index, showing that the reduced text set is more information-rich in tasks like biomedical relation extraction and keyphrase indexing.
The aim of this paper is to report on a novel text reduction technique, called Text Denoising, that highlights information-rich content when processing a large volume of text data, especially from the biomedical domain. The core feature of the technique, the text readability index, embodies the hypothesis that complex text is more information-rich than the rest. When applied on tasks like biomedical relation bearing text extraction, keyphrase indexing and extracting sentences describing protein interactions, it is evident that the reduced set of text produced by text denoising is more information-rich than the rest.