LGOct 1, 2025

Comparison of Machine Learning Models to Classify Documents on Digital Development

Uvini Ranaweera, Bawun Mawitagama, Sanduni Liyanage, Sandupa Keshan, Tiloka de Silva, Supun Hewawalpita

arXiv:2510.00720v2h-index: 3

Originality Synthesis-oriented

AI Analysis

This research addresses the problem of optimizing document classification for digital-development organizations, but it is incremental as it applies standard methods to a new dataset without introducing novel techniques.

The study tackled document classification for digital development interventions by comparing multiple machine learning models and using a One vs Rest approach, finding that model performance depends not only on data quantity but also on class similarity and dissimilarity, with specific metrics like F1-score reported but no concrete numbers provided.

Automated document classification is a trending topic in Natural Language Processing (NLP) due to the extensive growth in digital databases. However, a model that fits well for a specific classification task might perform weakly for another dataset due to differences in the context. Thus, training and evaluating several models is necessary to optimise the results. This study employs a publicly available document database on worldwide digital development interventions categorised under twelve areas. Since digital interventions are still emerging, utilising NLP in the field is relatively new. Given the exponential growth of digital interventions, this research has a vast scope for improving how digital-development-oriented organisations report their work. The paper examines the classification performance of Machine Learning (ML) algorithms, including Decision Trees, k-Nearest Neighbors, Support Vector Machine, AdaBoost, Stochastic Gradient Descent, Naive Bayes, and Logistic Regression. Accuracy, precision, recall and F1-score are utilised to evaluate the performance of these models, while oversampling is used to address the class-imbalanced nature of the dataset. Deviating from the traditional approach of fitting a single model for multiclass classification, this paper investigates the One vs Rest approach to build a combined model that optimises the performance. The study concludes that the amount of data is not the sole factor affecting the performance; features like similarity within classes and dissimilarity among classes are also crucial.

View on arXiv PDF

Similar