Fahad AlQurashi

AI
3papers
28citations
Novelty15%
AI Score15

3 Papers

AIOct 19, 2022
Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language

Istiak Ahmad, Fahad AlQurashi, Rashid Mehmood

Research in Natural Language Processing (NLP) has increasingly become important due to applications such as text classification, text mining, sentiment analysis, POS tagging, named entity recognition, textual entailment, and many others. This paper introduces several machine and deep learning methods with manual and automatic labelling for news classification in the Bangla language. We implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation (LDA) and investigate the performance of single-label and multi-label article classification methods. To investigate performance, we developed from scratch Potrika, the largest and the most extensive dataset for news classification in the Bangla language, comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles in eight distinct categories, curated from six popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83% achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data, respectively. The methods developed in this paper are expected to advance research in Bangla and other languages.

CLOct 17, 2022
Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Istiak Ahmad, Fahad AlQurashi, Rashid Mehmood

Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science \& Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

AIFeb 20, 2023
Multi-generational labour markets: data-driven discovery of multi-perspective system parameters using machine learning

Abeer Abdullah Alaql, Fahad Alqurashi, Rashid Mehmood

Economic issues, such as inflation, energy costs, taxes, and interest rates, are a constant presence in our daily lives and have been exacerbated by global events such as pandemics, environmental disasters, and wars. A sustained history of financial crises reveals significant weaknesses and vulnerabilities in the foundations of modern economies. Another significant issue currently is people quitting their jobs in large numbers. Moreover, many organizations have a diverse workforce comprising multiple generations posing new challenges. Transformative approaches in economics and labour markets are needed to protect our societies, economies, and planet. In this work, we use big data and machine learning methods to discover multi-perspective parameters for multi-generational labour markets. The parameters for the academic perspective are discovered using 35,000 article abstracts from the Web of Science for the period 1958-2022 and for the professionals' perspective using 57,000 LinkedIn posts from 2022. We discover a total of 28 parameters and categorised them into 5 macro-parameters, Learning & Skills, Employment Sectors, Consumer Industries, Learning & Employment Issues, and Generations-specific Issues. A complete machine learning software tool is developed for data-driven parameter discovery. A variety of quantitative and visualisation methods are applied and multiple taxonomies are extracted to explore multi-generational labour markets. A knowledge structure and literature review of multi-generational labour markets using over 100 research articles is provided. It is expected that this work will enhance the theory and practice of AI-based methods for knowledge discovery and system parameter discovery to develop autonomous capabilities and systems and promote novel approaches to labour economics and markets, leading to the development of sustainable societies and economies.