Corpus Statistics in Text Classification of Online Data
This work addresses the need for reproducible and transportable machine learning in online data analysis, though it is incremental as it applies existing methods to new data in a specific domain.
The study examined how corpus characteristics of textual datasets from online health forums affect text classification results, finding that specific corpus features significantly influence classification accuracy, with improvements of up to 15% observed when accounting for these characteristics.
Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application.