CLFeb 28, 2023

Text classification dataset and analysis for Uzbek language

arXiv:2302.14494v121 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of resources for Uzbek NLP by providing a dataset and baseline, but it is incremental as it applies existing methods to a new language.

The study tackled text classification for the Uzbek language by creating a new dataset from 10 news websites across 15 categories and evaluating models, with the BERTbek transformer model achieving the best performance.

Text classification is an important task in Natural Language Processing (NLP), where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes