CLNov 29, 2023Code
TurkishBERTweet: Fast and Reliable Large Language Model for Social Media AnalysisAli Najafi, Onur Varol
Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet
47.8SIApr 30Code
Social Media Data Toolkit: Standardization and Anonymization of Social Network DatasetsAli Najafi, Letizia Iannucci, Mikko Kivelä et al.
The rapid diversification of social media platforms and the increasing restrictions on official APIs have significantly complicated cross-platform analysis. Researchers are often forced to rely on heterogeneous datasets obtained through web scraping and historical archives; however they often lack structural consistency. Prior to conducting cross-platform social media analyses, one needs to answer three critical questions: (1) What makes platforms different and similar? (2) How were the datasets collected? (3) How can we align the datasets of different platforms to conduct fair analyses? To address these questions, we introduce the Social Media Data Toolkit (\projectname{}), a comprehensive Python framework designed for the standardization, anonymization, and enrichment of social network datasets. \projectname{} unifies diverse data structures into a generic schema comprising Communities, Accounts, Posts, Actions, and Entities to facilitate multi-platform research. The framework features a configurable anonymization module to secure Personally Identifiable Information (PII) and an extendable enrichment layer that integrates Large Language Models (LLMs) and network analysis tools for downstream tasks such as stance detection and toxicity scoring without creating codebase for different datasets. We demonstrate the versatility of \projectname{} through four case studies spanning from textual analysis of the content to network analysis across platforms. To offer reproducible social media research, \projectname{} is released as an open-source tool featuring detailed documentation and practical guides for researchers at any skill-level. It can be accessed at github.com/ViralLab/SMDT and varollab.com/SMDT.
CVMar 31, 2021
Drowsiness Detection Based On Driver Temporal Behavior Using a New Developed DatasetFarnoosh Faraji, Faraz Lotfi, Javad Khorramdel et al.
Driver drowsiness detection has been the subject of many researches in the past few decades and various methods have been developed to detect it. In this study, as an image-based approach with adequate accuracy, along with the expedite process, we applied YOLOv3 (You Look Only Once-version3) CNN (Convolutional Neural Network) for extracting facial features automatically. Then, LSTM (Long-Short Term Memory) neural network is employed to learn driver temporal behaviors including yawning and blinking time period as well as sequence classification. To train YOLOv3, we utilized our collected dataset alongside the transfer learning method. Moreover, the dataset for the LSTM training process is produced by the mentioned CNN and is formatted as a two-dimensional sequence comprised of eye blinking and yawning time durations. The developed dataset considers both disturbances such as illumination and drivers' head posture. To have real-time experiments a multi-thread framework is developed to run both CNN and LSTM in parallel. Finally, results indicate the hybrid of CNN and LSTM ability in drowsiness detection and the effectiveness of the proposed method.
IROct 11, 2020
ComStreamClust: a communicative multi-agent approach to text clustering in streaming dataAli Najafi, Araz Gholipour-Shilabin, Rahim Dehkharghani et al.
Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we propose a novel, multi-agent, communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g., COVID-19. The proposed approach is parallelizable, and can simultaneously handle several data-point. The LaBSE sentence embedding is used to measure the semantic similarity between two tweets. ComStreamClust has been evaluated on two datasets: the COVID-19 and the FA CUP. The results obtained from ComStreamClust approve the effectiveness of the proposed approach when compared to existing methods.