Polyvios Pratikakis

SI
6papers
38citations
Novelty28%
AI Score21

6 Papers

SIApr 7, 2022
Twitter Dataset on the Russo-Ukrainian War

Alexander Shevtsov, Christos Tzagkarakis, Despoina Antonakaki et al.

On 24 February 2022, Russia invaded Ukraine, also known now as Russo-Ukrainian War. We have initiated an ongoing dataset acquisition from Twitter API. Until the day this paper was written the dataset has reached the amount of 57.3 million tweets, originating from 7.7 million users. We apply an initial volume and sentiment analysis, while the dataset can be used to further exploratory investigation towards topic analysis, hate speech, propaganda recognition, or even show potential malicious entities like botnets.

SIJun 6, 2023
Russo-Ukrainian War: Prediction and explanation of Twitter suspension

Alexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou et al.

On 24 February 2022, Russia invaded Ukraine, starting what is now known as the Russo-Ukrainian War, initiating an online discourse on social media. Twitter as one of the most popular SNs, with an open and democratic character, enables a transparent discussion among its large user base. Unfortunately, this often leads to Twitter's policy violations, propaganda, abusive actions, civil integrity violation, and consequently to user accounts' suspension and deletion. This study focuses on the Twitter suspension mechanism and the analysis of shared content and features of the user accounts that may lead to this. Toward this goal, we have obtained a dataset containing 107.7M tweets, originating from 9.8 million users, using Twitter API. We extract the categories of shared content of the suspended accounts and explain their characteristics, through the extraction of text embeddings in junction with cosine similarity clustering. Our results reveal scam campaigns taking advantage of trending topics regarding the Russia-Ukrainian conflict for Bitcoin and Ethereum fraud, spam, and advertisement campaigns. Additionally, we apply a machine learning methodology including a SHapley Additive explainability model to understand and explain how user accounts get suspended.

SIApr 25, 2019Code
TwitterMancer: Predicting Interactions on Twitter Accurately

Konstantinos Sotiropoulos, John W. Byers, Polyvios Pratikakis et al.

This paper investigates the interplay between different types of user interactions on Twitter, with respect to predicting missing or unseen interactions. For example, given a set of retweet interactions between Twitter users, how accurately can we predict reply interactions? Is it more difficult to predict retweet or quote interactions between a pair of accounts? Also, how important is time locality, and which features of interaction patterns are most important to enable accurate prediction of specific Twitter interactions? Our empirical study of Twitter interactions contributes initial answers to these questions. We have crawled an extensive dataset of Greek-speaking Twitter accounts and their follow, quote, retweet, reply interactions over a period of a month. We find we can accurately predict many interactions of Twitter users. Interestingly, the most predictive features vary with the user profiles, and are not the same across all users. For example, for a pair of users that interact with a large number of other Twitter users, we find that certain "higher-dimensional" triads, i.e., triads that involve multiple types of interactions, are very informative, whereas for less active Twitter users, certain in-degrees and out-degrees play a major role. Finally, we provide various other insights on Twitter user behavior. Our code and data are available at https://github.com/twittermancer/. Keywords: Graph mining, machine learning, social media, social networks

SIMay 31, 2023
BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline

Alexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou et al.

Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges associated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we evaluate 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10% in terms of F1-score, achieving an average score of 83.19% and 68.5% over specific and general approaches, respectively. As a result of this research, we provide one of the largest labeled Twitter bot datasets. The dataset contains extracted features combined with BotArtist predictions for 10,929,533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War over a 16-month period. This dataset was created based on [Shevtsov et al., 2022a] where the original authors share anonymized tweets discussing the Russo-Ukrainian war, totaling 127,275,386 tweets. The combination of the existing textual dataset and the provided labeled bot and human profiles will enable future development of more advanced bot detection large language models in the post-Twitter API era.

SIApr 20, 2018
twAwler: A lightweight twitter crawler

Polyvios Pratikakis

This paper presents twAwler, a lightweight twitter crawler that targets language-specific communities of users. twAwler takes advantage of multiple endpoints of the twitter API to explore user relations and quickly recognize users belonging to the targetted set. It performs a complete crawl for all users, discovering many standard user relations, including the retweet graph, mention graph, reply graph, quote graph, follow graph, etc. twAwler respects all twitter policies and rate limits, while able to monitor large communities of active users. twAwler was used between August 2016 and March 2018 to generate an extensive dataset of close to all Greek-speaking twitter accounts (about 330 thousand) and their tweets and relations. In total, the crawler has gathered 750 million tweets of which 424 million are in Greek; 750 million follow relations; information about 300 thousand lists, their members (119 million member relations) and subscribers (27 thousand subscription relations); 705 thousand trending topics; information on 52 million users in total of which 292 thousand have been since suspended, 141 thousand have deleted their account, and 3.5 million are protected and cannot be crawled. twAwler mines the collected tweets for the retweet, quote, reply, and mention graphs, which, in addition to the follow relation crawled, offer vast opportunities for analysis and further research.

LGAug 23, 2017
Massively-Parallel Feature Selection for Big Data

Ioannis Tsamardinos, Giorgos Borboudakis, Pavlos Katsogridakis et al.

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of $p$-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class.