Anindya Iqbal

h-index3

20papers

1,124citations

Novelty33%

AI Score47

Ranked #59,765 of 201,326 authors (top 30%)#652 in SE (top 19%)

20 Papers

69.8SEApr 15Code

ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering

MD Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal et al.

Toxic interactions during code reviews can undermine teamwork and hinder productivity in software engineering (SE) teams. While prior studies explore toxicity detection and empirical investigation, they lack real-time detoxification tools to support the SE community. To address this gap, we present ToxiShield, a browser extension for GitHub pull requests that is built using three modules: i) Toxicity Filter -- to identify whether a text is toxic, ii) Communication coach -- to facilitate just-in-time fine-grained toxicity categorization with explanations, and iii) The Reframer -- that generates a revised, constructive alternative of a toxic text. For each module, we trained and evaluated multiple deep learning and Large Language Models (LLMs) to identify the best choice. A BERT-based binary detection model, trained on 38,761 code review samples, achieves 98% accuracy and an F1-score of 97% and is the selected one for the Toxicity Filter module. For the Communication Coach, prompt-tuned Claude 3.5 Sonnet achieved the best performance with 39% MCC and 42% F1 in multiclass toxicity classification with detailed reasoning. For Reframer, we evaluated five LLMs using a fine-tuning strategy on a dataset of 10,120 code review comments. The fine-tuned Llama 3.2 model achieves 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. We further validated ToxiShield through a human evaluation using the Technology Acceptance Model with 10 participants, confirming its perceived usefulness and ease of adoption. ToxiShield sets a benchmark for advancing constructive communication in software engineering, driving inclusivity and healthier collaboration in open-source communities.

SEJul 31, 2023Code

Contrastive Learning for API Aspect Analysis

G. M. Shahariar, Tahmid Hasan, Anindya Iqbal et al.

We present a novel approach - CLAA - for API aspect detection in API reviews that utilizes transformer models trained with a supervised contrastive loss objective function. We evaluate CLAA using performance and impact analysis. For performance analysis, we utilized a benchmark dataset on developer discussions collected from Stack Overflow and compare the results to those obtained using state-of-the-art transformer models. Our experiments show that contrastive learning can significantly improve the performance of transformer models in detecting aspects such as Performance, Security, Usability, and Documentation. For impact analysis, we performed empirical and developer study. On a randomly selected and manually labeled 200 online reviews, CLAA achieved 92% accuracy while the SOTA baseline achieved 81.5%. According to our developer study involving 10 participants, the use of 'Stack Overflow + CLAA' resulted in increased accuracy and confidence during API selection. Replication package: https://github.com/disa-lab/Contrastive-Learning-API-Aspect-ASE2023

45.4SEApr 10Code

Real-Time Toxicity Filtering for Open-Source Code Reviews

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal et al.

Toxic interactions in open-source software development harm community collaboration. To combat this, we propose ToxiShield, a realtime browser extension that identifies and detoxifies toxic code reviews. The framework comprises three modules: toxicity identification, reasoned multiclass classification, and code review detoxification. Our fine-tuned BERT-based binary classifier achieved a 97% F1-score on 38,761 code review texts. For multiclass classification, Claude 3.5 Sonnet with prompt engineering achieved a 39% MCC and 42% F1 on 1,200 samples. Finally, our fine-tuned Llama 3.2 detoxification model reached 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. Validation with 10 software developers suggests ToxiShield effectively fosters a more inclusive open-source environment.

LGApr 16, 2023

Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering

Rishov Paul, Md. Mohib Hossain, Mohammed Latif Siddiq et al.

Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset. Some recent studies also demonstrated strong empirical evidence that code review could improve the program repair further. Large language models, trained with Natural Language (NL) and Programming Language (PL), can contain inherent knowledge of both. In this study, we investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair. We applied PLBART and CodeT5, two state-of-the-art language models that are pre-trained with both PL and NL, on two such natural language-based program repair datasets and found that the pre-trained language models fine-tuned with datasets containing both code review and subsequent code changes notably outperformed each of the previous models. With the advent of code generative models like Codex and GPT-3.5-Turbo, we also performed zero-shot and few-shots learning-based prompt engineering to assess their performance on these datasets. However, the practical application of using LLMs in the context of automated program repair is still a long way off based on our manual analysis of the generated repaired codes by the learning models.

CLMar 22, 2022

Are You Misinformed? A Study of Covid-Related Fake News in Bengali on Facebook

Protik Bose Pranto, Syed Zami-Ul-Haque Navid, Protik Dey et al.

Our opinions and views of life can be shaped by how we perceive the opinions of others on social media like Facebook. This dependence has increased during COVID-19 periods when we have fewer means to connect with others. However, fake news related to COVID-19 has become a significant problem on Facebook. Bengali is the seventh most spoken language worldwide, yet we are aware of no previous research that studied the prevalence of COVID-19 related fake news in Bengali on Facebook. In this paper, we develop machine learning models to detect fake news in Bengali automatically. The best performing model is BERT, with an F1-score of 0.97. We apply BERT on all Facebook Bengali posts related to COVID-19. We find 10 topics in the COVID-19 Bengali fake news grouped into three categories: System (e.g., medical system), belief (e.g., religious rituals), and social (e.g., scientific awareness).

SENov 15, 2024Code

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

Md. Asif Haider, Ayesha Binte Mostofa, Sk. Sabit Bin Mosaddek et al.

Generating accurate code review comments remains a significant challenge due to the inherently diverse and non-unique nature of the task output. Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. However, large-scale pretraining is not always feasible due to its environmental impact and project-specific generalizability issues. In this work, first we fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank (QLoRA) fashion on consumer-grade hardware to improve review comment generation. Recent studies demonstrate the efficacy of augmenting semantic metadata information into prompts to boost performance in other code-related tasks. To explore this in code review activities, we also prompt proprietary, closed-source LLMs augmenting the input code patch with function call graphs and code summaries. Both of our strategies improve the review comment generation performance, with function call graph augmented few-shot prompting on the GPT-3.5 model surpassing the pretrained baseline by around 90% BLEU-4 score on the CodeReviewer dataset. Moreover, few-shot prompted Gemini-1.0 Pro, QLoRA fine-tuned Code Llama and Llama 3.1 models achieve competitive results (ranging from 25% to 83% performance improvement) on this task. An additional human evaluation study further validates our experimental findings, reflecting real-world developers' perceptions of LLM-generated code review comments based on relevant qualitative metrics.

CRNov 9, 2023

LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Sihat Afnan, Mushtari Sadia, Shahrear Iqbal et al.

Cyber attacks are often identified using system and network logs. There have been significant prior works that utilize provenance graphs and ML techniques to detect attacks, specifically advanced persistent threats, which are very difficult to detect. Lately, there have been studies where transformer-based language models are being used to detect various types of attacks from system logs. However, no such attempts have been made in the case of APTs. In addition, existing state-of-the-art techniques that use system provenance graphs, lack a data processing framework generalized across datasets for optimal performance. For mitigating this limitation as well as exploring the effectiveness of transformer-based language models, this paper proposes LogShield, a framework designed to detect APT attack patterns leveraging the power of self-attention in transformers. We incorporate customized embedding layers to effectively capture the context of event sequences derived from provenance graphs. While acknowledging the computational overhead associated with training transformer networks, our framework surpasses existing LSTM and Language models regarding APT detection. We integrated the model parameters and training procedure from the RoBERTa model and conducted extensive experiments on well-known APT datasets (DARPA OpTC and DARPA TC E3). Our framework achieved superior F1 scores of 98% and 95% on the two datasets respectively, surpassing the F1 scores of 96% and 94% obtained by LSTM models. Our findings suggest that LogShield's performance benefits from larger datasets and demonstrates its potential for generalization across diverse domains. These findings contribute to the advancement of APT attack detection methods and underscore the significance of transformer-based architectures in addressing security challenges in computer systems.

CLMay 29, 2021Code

CoDesc: A Large Code-Description Parallel Dataset

Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq et al.

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.

SEJan 26, 2021Code

Using a Balanced Scorecard to Identify Opportunities to Improve Code Review Effectiveness: An Industrial Experience Report

Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam et al.

Peer code review is a widely adopted software engineering practice to ensure code quality and ensure software reliability in both the commercial and open-source software projects. Due to the large effort overhead associated with practicing code reviews, project managers often wonder, if their code reviews are effective and if there are improvement opportunities in that respect. Since project managers at Samsung Research Bangladesh (SRBD) were also intrigued by these questions, this research developed, deployed, and evaluated a production-ready solution using the Balanced SCorecard (BSC) strategy that SRBD managers can use in their day-to-day management to monitor individual developer's, a particular project's or the entire organization's code review effectiveness. Following the four-step framework of the BSC strategy, we: 1) defined the operation goals of this research, 2) defined a set of metrics to measure the effectiveness of code reviews, 3) developed an automated mechanism to measure those metrics, and 4) developed and evaluated a monitoring application to inform the key stakeholders. Our automated model to identify useful code reviews achieves 7.88% and 14.39% improvement in terms of accuracy and minority class F1 score respectively over the models proposed in prior studies. It also outperforms human evaluators from SRBD, that the model replaces, by a margin of 25.32% and 23.84% respectively in terms of accuracy and minority class F1 score. In our post-deployment survey, SRBD developers and managers indicated that they found our solution as useful and it provided them with important insights to help their decision makings.

CLJan 1, 2021Code

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad et al.

In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

SEDec 7, 2019Code

Early Prediction for Merged vs Abandoned Code Changes in Modern Code Reviews

Md. Khairul Islam, Toufique Ahmed, Rifat Shahriyar et al.

The modern code review process is an integral part of the current software development practice. Considerable effort is given here to inspect code changes, find defects, suggest an improvement, and address the suggestions of the reviewers. In a code review process, usually, several iterations take place where an author submits code changes and a reviewer gives feedback until is happy to accept the change. In around 12% cases, the changes are abandoned, eventually wasting all the efforts. In this research, our objective is to design a tool that can predict whether a code change would be merged or abandoned at an early stage to reduce the waste of efforts of all stakeholders (e.g., program author, reviewer, project management, etc.) involved. The real-world demand for such a tool was formally identified by a study by Fan et al. [1]. We have mined 146,612 code changes from the code reviews of three large and popular open-source software and trained and tested a suite of supervised machine learning classifiers, both shallow and deep learning based. We consider a total of 25 features in each code change during the training and testing of the models. The best performing model named PredCR (Predicting Code Review), a LightGBM-based classifier achieves around 85% AUC score on average and relatively improves the state-of-the-art [1] by 14-23%. In our empirical study on the 146,612 code changes from the three software projects, we find that (1) The new features like reviewer dimensions that are introduced in PredCR are the most informative. (2) Compared to the baseline, PredCR is more effective towards reducing bias against new developers. (3) PredCR uses historical data in the code review repository and as such the performance of PredCR improves as a software system evolves with new and more data.

SENov 10, 2018Code

Understanding the Motivations, Challenges and Needs of Blockchain Software Developers: A Survey

Amiangshu Bosu, Anindya Iqbal, Rifat Shahriyar et al.

The blockchain technology has potential applications in various areas such as smart-contracts, Internet of Things (IoT), land registry, supply chain management, storing medical data, and identity management. Although the Github currently hosts more than six thousand active Blockchain software (BCS) projects, few software engineering research has investigated these projects and its' contributors. Although the number of BCS projects is growing rapidly, the motivations, challenges, and needs of BCS developers remain a puzzle. Therefore, the primary objective of this study is to understand the motivations, challenges, and needs of BCS developers and analyze the differences between BCS and non-BCS development. On this goal, we sent an online survey to 1,604 active BCS developers identified via mining the Github repositories of 145 popular BCS projects. The survey received 156 responses that met our criteria for analysis. The results suggest that the majority of the BCS developers are experienced in non-BCS development and are primarily motivated by the ideology of creating a decentralized financial system. Although most of the BCS projects are Open Source Software (OSS) projects by nature, more than 93% of our respondents found BCS development somewhat different from a non-BCS development as BCS projects have higher emphasis on security and reliability than most of the non-BCS projects. Other differences include: higher costs of defects, decentralized and hostile environment, technological complexity, and difficulty in upgrading the software after release. Software development tools that are tuned for non-BCS development are inadequate for BCS and the ecosystem needs an array of new or improved tools, such as: customized IDE for BCS development tasks, debuggers for smart-contracts, testing support, easily deployable simulators, and BCS domain specific design notations.

SEJul 20, 2021

A Survey-Based Qualitative Study to Characterize Expectations of Software Developers from Five Stakeholders

Khalid Hasan, Partho Chakraborty, Rifat Shahriyar et al.

Background: Studies on developer productivity and well-being find that the perceptions of productivity in a software team can be a socio-technical problem. Intuitively, problems and challenges can be better handled by managing expectations in software teams. Aim: Our goal is to understand whether the expectations of software developers vary towards diverse stakeholders in software teams. Method: We surveyed 181 professional software developers to understand their expectations from five different stakeholders: (1) organizations, (2) managers, (3) peers, (4) new hires, and (5) government and educational institutions. The five stakeholders are determined by conducting semi-formal interviews of software developers. We ask open-ended survey questions and analyze the responses using open coding. Results: We observed 18 multi-faceted expectations types. While some expectations are more specific to a stakeholder, other expectations are cross-cutting. For example, developers expect work-benefits from their organizations, but expect the adoption of standard software engineering (SE) practices from their organizations, peers, and new hires. Conclusion: Out of the 18 categories, three categories are related to career growth. This observation supports previous research that happiness cannot be assured by simply offering more money or a promotion. Among the most number of responses, we find expectations from educational institutions to offer relevant teaching and from governments to improve job stability, which indicate the increasingly important roles of these organizations to help software developers. This observation can be especially true during the COVID-19 pandemic.

SEMay 4, 2021

How do developers discuss and support new programming languages in technical Q&A site? An empirical study of Go, Swift, and Rust in Stack Overflow

Partha Chakraborty, Rifat Shahriyar, Anindya Iqbal et al.

New programming languages (e.g., Swift, Go, Rust, etc.) are being introduced to provide a better opportunity for the developers to make software development robust and easy. At the early stage, a programming language is likely to have resource constraints that encourage the developers to seek help frequently from experienced peers active in QA sites such as Stack Overflow (SO). In this study, we have formally studied the discussions on three popular new languages introduced after the inception of SO (2008) and match those with the relevant activities in GitHub whenever appropriate. For that purpose, we have mined 4,17,82,536 questions and answers from SO and 7,846 issue information along with 6,60,965 repository information from GitHub. Initially, the development of new languages is relatively slow compared to mature languages (e.g., C, C++, Java). The expected outcome of this study is to reveal the difficulties and challenges faced by the developers working with these languages so that appropriate measures can be taken to expedite the generation of relevant resources. We have used the LDA method on SO's questions and answers to identify different topics of new languages. We have extracted several features of the answer pattern of the new languages from SO to study their characteristics. These attributes were used to identify difficult topics. We explored the background of developers who are contributing to these languages. We have created a model by combining Stack Overflow data and issues, repository, user data of GitHub. Finally, we have used that model to identify factors that affect language evolution. We believe that the outcome of our study is likely to help the owner/sponsor of these languages to design better features and documentation. It will also help the software developers or students to prepare themselves to work on these languages in an informed way.

SEApr 16, 2021

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Abdullah Al Ishtiaq, Masum Hasan, Md. Mahim Anjum Haque et al.

Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster. Although the existing methods have shown good performance in searching codes when the natural language description contains keywords from the code, they are still far behind in searching codes based on the semantic meaning of the natural language query and semantic structure of the code. In recent years, both natural language and programming language research communities have created techniques to embed them in vector spaces. In this work, we leverage the efficacy of these embedding models using a simple, lightweight 2-layer neural network in the task of semantic code search. We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement by empirically analyzing the embedding methods. In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance, and discuss future directions of study in this area.

SEMar 21, 2021

An Empirical Study of Developer Discussions on Low-Code Software Development Challenges

Md Abdullah Al Alamin, Sanjay Malakar, Gias Uddin et al.

Low-code software development (LCSD) is an emerging paradigm that combines minimal source code with interactive graphical interfaces to promote rapid application development. LCSD aims to democratize application development to software practitioners with diverse backgrounds. Given that LCSD is relatively a new paradigm, it is vital to learn about the challenges developers face during their adoption of LCSD platforms. The online developer forum, Stack Overflow (SO), is popular among software developers to ask for solutions to their technical problems. We observe a growing body of posts in SO with discussions of LCSD platforms. In this paper, we present an empirical study of around 5K SO posts (questions + accepted answers) that contain discussions of nine popular LCSD platforms. We apply topic modeling on the posts to determine the types of topics discussed. We find 13 topics related to LCSD in SO. The 13 topics are grouped into four categories: Customization, Platform Adoption, Database Management, and Third-Party Integration. More than 40% of the questions are about customization, i.e., developers frequently face challenges with customizing user interfaces or services offered by LCSD platforms. The topic "Dynamic Event Handling" under the "Customization" category is the most popular (in terms of average view counts per question of the topic) as well as the most difficult. It means that developers frequently search for customization solutions such as how to attach dynamic events to a form in low-code UI, yet most (75.9%) of their questions remain without an accepted answer. We manually label 900 questions from the posts to determine the prevalence of the topics' challenges across LCSD phases. We find that most of the questions are related to the development phase, and low-code developers also face challenges with automated testing.

SEFeb 16, 2021

Automatic Detection of Five API Documentation Smells: Practitioners' Perspectives

Junaed Younus Khan, Md. Tawkat Islam Khondaker, Gias Uddin et al.

The learning and usage of an API is supported by official documentation. Like source code, API documentation is itself a software product. Several research results show that bad design in API documentation can make the reuse of API features difficult. Indeed, similar to code smells or code antipatterns, poorly designed API documentation can also exhibit 'smells'. Such documentation smells can be described as bad documentation styles that do not necessarily produce an incorrect documentation but nevertheless make the documentation difficult to properly understand and to use. Recent research on API documentation has focused on finding content inaccuracies in API documentation and to complement API documentation with external resources (e.g., crowd-shared code examples). We are aware of no research that focused on the automatic detection of API documentation smells. This paper makes two contributions. First, we produce a catalog of five API documentation smells by consulting literature on API documentation presentation problems. We create a benchmark dataset of 1,000 API documentation units by exhaustively and manually validating the presence of the five smells in Java official API reference and instruction documentation. Second, we conduct a survey of 21 professional software developers to validate the catalog. The developers agreed that they frequently encounter all five smells in API official documentation and 95.2% of them reported that the presence of the documentation smells negatively affects their productivity. The participants wished for tool support to automatically detect and fix the smells in API official documentation. We develop a suite of rule-based, deep and shallow machine learning classifiers to automatically detect the smells. The best performing classifier BERT, a deep learning model, achieves F1-scores of 0.75 - 0.97.

CVOct 4, 2020

Static and Animated 3D Scene Generation from Free-form Text Descriptions

Faria Huq, Nafees Ahmed, Anindya Iqbal

Generating coherent and useful image/video scenes from a free-form textual description is technically a very difficult problem to handle. Textual description of the same scene can vary greatly from person to person, or sometimes even for the same person from time to time. As the choice of words and syntax vary while preparing a textual description, it is challenging for the system to reliably produce a consistently desirable output from different forms of language input. The prior works of scene generation have been mostly confined to rigorous sentence structures of text input which restrict the freedom of users to write description. In our work, we study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description without any major restriction. In particular, to keep our study practical and tractable, we focus on a small subspace of all possible 3D scenes, containing various combinations of cube, cylinder and sphere. We design a two-stage pipeline. In the first stage, we encode the free-form text using an encoder-decoder neural architecture. In the second stage, we generate a 3D scene based on the generated encoding. Our neural architecture exploits state-of-the-art language model as encoder to leverage rich contextual encoding and a new multi-head decoder to predict multiple features of an object in the scene simultaneously. For our experiments, we generate a large synthetic data-set which contains 13,00,000 and 14,00,000 samples of unique static and animated scene descriptions, respectively. We achieve 98.427% accuracy on test data set in detecting the 3D objects features successfully. Our work shows a proof of concept of one approach towards solving the problem, and we believe with enough training data, the same pipeline can be expanded to handle even broader set of 3D scene generation problems.

SEOct 4, 2020

Review4Repair: Code Review Aided Automatic Program Repairing

Faria Huq, Masum Hasan, Mahim Anzum Haque Pantho et al.

Context: Learning-based automatic program repair techniques are showing promise to provide quality fix suggestions for detected bugs in the source code of the software. These tools mostly exploit historical data of buggy and fixed code changes and are heavily dependent on bug localizers while applying to a new piece of code. With the increasing popularity of code review, dependency on bug localizers can be reduced. Besides, the code review-based bug localization is more trustworthy since reviewers' expertise and experience are reflected in these suggestions. Objective: The natural language instructions scripted on the review comments are enormous sources of information about the bug's nature and expected solutions. However, none of the learning-based tools has utilized the review comments to fix programming bugs to the best of our knowledge. In this study, we investigate the performance improvement of repair techniques using code review comments. Method: We train a sequence-to-sequence model on 55,060 code reviews and associated code changes. We also introduce new tokenization and preprocessing approaches that help to achieve significant improvement over state-of-the-art learning-based repair techniques. Results: We boost the top-1 accuracy by 20.33% and top-10 accuracy by 34.82%. We could provide a suggestion for stylistics and non-code errors unaddressed by prior techniques. Conclusion: We believe that the automatic fix suggestions along with code review generated by our approach would help developers address the review comment quickly and correctly and thus save their time and effort.

CLMay 12, 2019

A Benchmark Study of Machine Learning Models for Online Fake News Detection

Junaed Younus Khan, Md. Tawkat Islam Khondaker, Sadia Afroz et al.

The proliferation of fake news and its propagation on social media has become a major concern due to its ability to create devastating impacts. Different machine learning approaches have been suggested to detect fake news. However, most of those focused on a specific type of news (such as political) which leads us to the question of dataset-bias of the models used. In this research, we conducted a benchmark study to assess the performance of different applicable machine learning approaches on three different datasets where we accumulated the largest and most diversified one. We explored a number of advanced pre-trained language models for fake news detection along with the traditional and deep learning ones and compared their performances from different aspects for the first time to the best of our knowledge. We find that BERT and similar pre-trained models perform the best for fake news detection, especially with very small dataset. Hence, these models are significantly better option for languages with limited electronic contents, i.e., training data. We also carried out several analysis based on the models' performance, article's topic, article's length, and discussed different lessons learned from them. We believe that this benchmark study will help the research community to explore further and news sites/blogs to select the most appropriate fake news detection method.