SIMar 17, 2023
BotShape: A Novel Social Bots Detection Approach via Behavioral PatternsJun Wu, Xuesong Ye, Chengjie Mou
An essential topic in online social network security is how to accurately detect bot accounts and relieve their harmful impacts (e.g., misinformation, rumor, and spam) on genuine users. Based on a real-world data set, we construct behavioral sequences from raw event logs. After extracting critical characteristics from behavioral time series, we observe differences between bots and genuine users and similar patterns among bot accounts. We present a novel social bot detection system BotShape, to automatically catch behavioral sequences and characteristics as features for classifiers to detect bots. We evaluate the detection performance of our system in ground-truth instances, showing an average accuracy of 98.52% and an average f1-score of 96.65% on various types of classifiers. After comparing it with other research, we conclude that BotShape is a novel approach to profiling an account, which could improve performance for most methods by providing significant behavioral features.
AIApr 24, 2023
FineEHR: Refine Clinical Note Representations to Improve Mortality PredictionJun Wu, Xuesong Ye, Chengjie Mou et al.
Monitoring the health status of patients in the Intensive Care Unit (ICU) is a critical aspect of providing superior care and treatment. The availability of large-scale electronic health records (EHR) provides machine learning models with an abundance of clinical text and vital sign data, enabling them to make highly accurate predictions. Despite the emergence of advanced Natural Language Processing (NLP) algorithms for clinical note analysis, the complex textual structure and noise present in raw clinical data have posed significant challenges. Coarse embedding approaches without domain-specific refinement have limited the accuracy of these algorithms. To address this issue, we propose FINEEHR, a system that utilizes two representation learning techniques, namely metric learning and fine-tuning, to refine clinical note embeddings, while leveraging the intrinsic correlations among different health statuses and note categories. We evaluate the performance of FINEEHR using two metrics, namely Area Under the Curve (AUC) and AUC-PR, on a real-world MIMIC III dataset. Our experimental results demonstrate that both refinement approaches improve prediction accuracy, and their combination yields the best results. Moreover, our proposed method outperforms prior works, with an AUC improvement of over 10%, achieving an average AUC of 96.04% and an average AUC-PR of 96.48% across various classifiers.
AIApr 6, 2023
BotTriNet: A Unified and Efficient Embedding for Social Bots Detection via Metric LearningJun Wu, Xuesong Ye, Yanyuet Man
The rapid and accurate identification of bot accounts in online social networks is an ongoing challenge. In this paper, we propose BOTTRINET, a unified embedding framework that leverages the textual content posted by accounts to detect bots. Our approach is based on the premise that account personalities and habits can be revealed through their contextual content. To achieve this, we designed a triplet network that refines raw embeddings using metric learning techniques. The BOTTRINET framework produces word, sentence, and account embeddings, which we evaluate on a real-world dataset, CRESCI2017, consisting of three bot account categories and five bot sample sets. Our approach achieves state-of-the-art performance on two content-intensive bot sets, with an average accuracy of 98.34% and f1score of 97.99%. Moreover, our method makes a significant breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%. Our contribution is twofold: First, we propose a unified and effective framework that combines various embeddings for bot detection. Second, we demonstrate that metric learning techniques can be applied in this context to refine raw embeddings and improve classification performance. Our approach outperforms prior works and sets a new standard for bot detection in social networks.
SIMay 30, 2023
FakeSwarm: Improving Fake News Detection with Swarming CharacteristicsJun Wu, Xuesong Ye
The proliferation of fake news poses a serious threat to society, as it can misinform and manipulate the public, erode trust in institutions, and undermine democratic processes. To address this issue, we present FakeSwarm, a fake news identification system that leverages the swarming characteristics of fake news. To extract the swarm behavior, we propose a novel concept of fake news swarming characteristics and design three types of swarm features, including principal component analysis, metric representation, and position encoding. We evaluate our system on a public dataset and demonstrate the effectiveness of incorporating swarm features in fake news identification, achieving an f1-score and accuracy of over 97% by combining all three types of swarm features. Furthermore, we design an online learning pipeline based on the hypothesis of the temporal distribution pattern of fake news emergence, validated on a topic with early emerging fake news and a shortage of text samples, showing that swarm features can significantly improve recall rates in such cases. Our work provides a new perspective and approach to fake news detection and highlights the importance of considering swarming characteristics in detecting fake news.
LGMay 19, 2023
MedLens: Improve Mortality Prediction Via Medical Signs Selecting and RegressionXuesong Ye, Jun Wu, Chengjie Mou et al.
Monitoring the health status of patients and predicting mortality in advance is vital for providing patients with timely care and treatment. Massive medical signs in electronic health records (EHR) are fitted into advanced machine learning models to make predictions. However, the data-quality problem of original clinical signs is less discussed in the literature. Based on an in-depth measurement of the missing rate and correlation score across various medical signs and a large amount of patient hospital admission records, we discovered the comprehensive missing rate is extremely high, and a large number of useless signs could hurt the performance of prediction models. Then we concluded that only improving data-quality could improve the baseline accuracy of different prediction algorithms. We designed MEDLENS, with an automatic vital medical signs selection approach via statistics and a flexible interpolation approach for high missing rate time series. After augmenting the data-quality of original medical signs, MEDLENS applies ensemble classifiers to boost the accuracy and reduce the computation overhead at the same time. It achieves a very high accuracy performance of 0.96 AUC-ROC and 0.81 AUC-PR, which exceeds the previous benchmark.