CLIRLGMar 16, 2018

Corpus Statistics in Text Classification of Online Data

arXiv:1803.06390v13 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for reproducible and transportable machine learning in online data analysis, though it is incremental as it applies existing methods to new data in a specific domain.

The study examined how corpus characteristics of textual datasets from online health forums affect text classification results, finding that specific corpus features significantly influence classification accuracy, with improvements of up to 15% observed when accounting for these characteristics.

Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes