CLNov 21, 2019

An Empirical Study of Sections in Classifying Disease Outbreak Reports

arXiv:1911.09319v10.33 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for effective text classification in bio-surveillance systems that monitor news articles for infectious disease outbreaks, but it is incremental as it builds on existing methods by analyzing section-specific contributions.

The study tackled the problem of classifying disease outbreak reports from loosely structured news articles by investigating the importance of different sections and section weighting, finding that using the headline and leading sentence achieved high F-scores, full text achieved the highest recall, and section weighting improved accuracy.

Identifying articles that relate to infectious diseases is a necessary step for any automatic bio-surveillance system that monitors news articles from the Internet. Unlike scientific articles which are available in a strongly structured form, news articles are usually loosely structured. In this chapter, we investigate the importance of each section and the effect of section weighting on performance of text classification. The experimental results show that (1) classification models using the headline and leading sentence achieve a high performance in terms of F-score compared to other parts of the article; (2) all section with bag-of-word representation (full text) achieves the highest recall; and (3) section weighting information can help to improve accuracy.

View on arXiv PDF

Similar