CLOct 10, 2020Code
Tag Recommendation for Online Q&A Communities based on BERT Pre-Training TechniqueNavid Khezrian, Jafar Habibi, Issa Annamoradnejad
Online Q&A and open source communities use tags and keywords to index, categorize, and search for specific content. The most obvious advantage of tag recommendation is the correct classification of information. In this study, we used the BERT pre-training technique in tag recommendation task for online Q&A and open-source communities for the first time. Our evaluation on freecode datasets show that the proposed method, called TagBERT, is more accurate compared to deep learning and other baseline methods. Moreover, our model achieved a high stability by solving the problem of previous researches, where increasing the number of tag recommendations significantly reduced model performance.
LGAug 10, 2020
Using Experts' Opinions in Machine Learning TasksJafar Habibi, Amir Fazelinia, Issa Annamoradnejad
In machine learning tasks, especially in the tasks of prediction, scientists tend to rely solely on available historical data and disregard unproven insights, such as experts' opinions, polls, and betting odds. In this paper, we propose a general three-step framework for utilizing experts' insights in machine learning tasks and build four concrete models for a sports game prediction case study. For the case study, we have chosen the task of predicting NCAA Men's Basketball games, which has been the focus of a group of Kaggle competitions in recent years. Results highly suggest that the good performance and high scores of the past models are a result of chance, and not because of a good-performing and stable model. Furthermore, our proposed models can achieve more steady results with lower log loss average (best at 0.489) compared to the top solutions of the 2019 competition (>0.503), and reach the top 1%, 10% and 1% in the 2017, 2018 and 2019 leaderboards, respectively.
CLApr 27, 2020
ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational HumorIssa Annamoradnejad, Gohar Zoghi
Automation of humor detection and rating has interesting use cases in modern technologies, such as humanoid robots, chatbots, and virtual assistants. In this paper, we propose a novel approach for detecting and rating humor in short texts based on a popular linguistic theory of humor. The proposed technical method initiates by separating sentences of the given text and utilizing the BERT model to generate embeddings for each one. The embeddings are fed to separate lines of hidden layers in a neural network (one line for each sentence) to extract latent features. At last, the parallel lines are concatenated to determine the congruity and other relationships between the sentences and predict the target value. We accompany the paper with a novel dataset for humor detection consisting of 200,000 formal short texts. In addition to evaluating our work on the novel dataset, we participated in a live machine learning competition focused on rating humor in Spanish tweets. The proposed model obtained F1 scores of 0.982 and 0.869 in the humor detection experiments which outperform general and state-of-the-art models. The evaluation performed on two contrasting settings confirm the strength and robustness of the model and suggests two important factors in achieving high accuracy in the current task: 1) usage of sentence embeddings and 2) utilizing the linguistic structure of humor in designing the proposed model.
CLFeb 24, 2020
Predicting Subjective Features of Questions of QA Websites using BERTIssa Annamoradnejad, Mohammadamin Fazli, Jafar Habibi
Community Question-Answering websites, such as StackOverflow and Quora, expect users to follow specific guidelines in order to maintain content quality. These systems mainly rely on community reports for assessing contents, which has serious problems such as the slow handling of violations, the loss of normal and experienced users' time, the low quality of some reports, and discouraging feedback to new users. Therefore, with the overall goal of providing solutions for automating moderation actions in Q&A websites, we aim to provide a model to predict 20 quality or subjective aspects of questions in QA websites. To this end, we used data gathered by the CrowdSource team at Google Research in 2019 and a fine-tuned pre-trained BERT model on our problem. Based on the evaluation by Mean-Squared-Error (MSE), the model achieved a value of 0.046 after 2 epochs of training, which did not improve substantially in the next ones. Results confirm that by simple fine-tuning, we can achieve accurate models in little time and on less amount of data.
SIJul 21, 2019
A Comprehensive Analysis of Twitter Trending TopicsIssa Annamoradnejad, Jafar Habibi
In Twitter, a name, phrase, or topic that is mentioned at a greater rate than others is called a "trending topic" or simply "trend". Twitter trends list has a powerful ability to promote public events such as natural events, political scandals, market changes and other types of breaking news. Nevertheless, there have been very few works focused on the dynamics of these trending topics. In this article, we thoroughly examined the Twitter's trending topics of 2018. To this end, we automatically accessed Twitter's trends API and stored the resulting 50 top trending topics in a novel dataset. We propose and analyze our dataset according to six criteria: lexical analysis, time to reach, trend reoccurrence, trending time, tweets count, and language analysis. Based on our results, 77.6% of the topics that reached the Top-10 list were trending with less than 100k tweets. More than 50% of the topics could not hold the position for more than an hour. English and Arabic languages comprised close to 40% and 20% of the first rank topics, respectively.