SIApr 7, 2022
Twitter Dataset on the Russo-Ukrainian WarAlexander Shevtsov, Christos Tzagkarakis, Despoina Antonakaki et al.
On 24 February 2022, Russia invaded Ukraine, also known now as Russo-Ukrainian War. We have initiated an ongoing dataset acquisition from Twitter API. Until the day this paper was written the dataset has reached the amount of 57.3 million tweets, originating from 7.7 million users. We apply an initial volume and sentiment analysis, while the dataset can be used to further exploratory investigation towards topic analysis, hate speech, propaganda recognition, or even show potential malicious entities like botnets.
SIJun 6, 2023
Russo-Ukrainian War: Prediction and explanation of Twitter suspensionAlexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou et al.
On 24 February 2022, Russia invaded Ukraine, starting what is now known as the Russo-Ukrainian War, initiating an online discourse on social media. Twitter as one of the most popular SNs, with an open and democratic character, enables a transparent discussion among its large user base. Unfortunately, this often leads to Twitter's policy violations, propaganda, abusive actions, civil integrity violation, and consequently to user accounts' suspension and deletion. This study focuses on the Twitter suspension mechanism and the analysis of shared content and features of the user accounts that may lead to this. Toward this goal, we have obtained a dataset containing 107.7M tweets, originating from 9.8 million users, using Twitter API. We extract the categories of shared content of the suspended accounts and explain their characteristics, through the extraction of text embeddings in junction with cosine similarity clustering. Our results reveal scam campaigns taking advantage of trending topics regarding the Russia-Ukrainian conflict for Bitcoin and Ethereum fraud, spam, and advertisement campaigns. Additionally, we apply a machine learning methodology including a SHapley Additive explainability model to understand and explain how user accounts get suspended.
56.9SIMay 20
Detecting Synthetic Political Narratives in Cross-Platform Social Media DiscourseDespoina Antonakaki, Sotiris Ioannidis
The proliferation of large language models has introduced a new paradigm of synthetic political communication in which narratives may be generated, semantically coordinated, and strategically disseminated across platforms at scale. We present a cross-platform framework for detecting synthetic political narratives using four coordination signals -- lexical diversity D(C), temporal burstiness B(C), rhetorical repetition R(C), and semantic homogenization H(C) -- combined into a Synthetic Narrative Coordination Score SNC(C). We apply the framework to a corpus of 353,223 records spanning six geopolitical event windows collected from six Telegram channels and nine Reddit communities (2023--2026). Results show that IntelSlava exhibits the lowest lexical diversity (MATTR 0.52--0.54), the highest burstiness (B=+0.48 to +0.73), and the highest rhetorical overlap with peer channels (Jaccard 0.12), ranking first in the composite SNC(C) on four of six event windows (SNC 0.45--0.60). Rybar ranks last on all windows despite its high semantic homogenization, because its Russian-language output yields high lexical diversity and near-zero rhetorical Jaccard with English-language channels -- demonstrating that no single indicator is sufficient for coordination detection. Multi-dimensional SNC(C) scoring provides a more robust and interpretable signal than any individual metric.
62.6CYApr 16
From Parliamentary Rhetoric to Enacted Law: An NLP Pipeline for Semantic Auditing of the Greek Legislative ProcessDespoina Antonakaki, Sotiris Ioannidis
The Greek legislative framework is characterized by intricate cross-referencing, frequent amendments, and limited machine-readable access, hindering transparency and civic engagement. Traditional bulk-archiving approaches are computationally expensive and fail to capture political relevance. We present a multimodal computational pipeline that bridges parliamentary discourse with enacted legislation. Applying Natural Language Processing (NLP) to 2025 Hellenic Parliament transcripts, we extracted 534 unique law citations and used debate frequency as an empirical signal to identify politically salient laws. A headless browser architecture enables automated acquisition of official Government Gazette documents despite anti-scraping barriers. Using Large Language Models (LLMs), we conduct a semantic audit of legislative quality. Our analysis reveals an "Illusion of Simplicity", where laws framed as simplifications exhibit high structural complexity and ambiguity. A typology of 312 ambiguity instances shows that 45 percent stem from vague terminology and 25 percent from deferred executive delegation. We introduce the Political Discrepancy Index (PDI), evaluating alignment between ministerial promises and enacted law. Across three high-frequency laws (4808/2021, 4412/2016, 4662/2020), the dominant outcome is Deferral, with commitments shifted to future Ministerial Decisions. Cross-reference network analysis confirms a highly entangled legal system, with foundational provisions among the most frequently amended. The pipeline produces a semantically linked dataset and an interactive auditing interface for scalable analysis of legislative processes.
SIMar 3
Cross-Platform Digital Discourse Analysis of Iran: Topics, Sentiment, Polarization, and Event Validation on Telegram and RedditDespoina Antonakaki, Sotiris Ioannidis
We analyze Iran-related discourse across two structurally different platforms: Telegram (7,567 messages from international news channels) and Reddit (23,909 posts and comments from Iran-focused and global communities). Using a single reproducible pipeline, we apply NMF topic modeling over TF--IDF features, VADER sentiment scoring, and a keyword-bundle escalation index capturing military, nuclear, and diplomatic narratives. To assess whether discourse dynamics track offline developments, we compare escalation time series with external protest and geopolitical event timelines using same-day and lagged correlation analysis. Same-day correlations are weak, but the strongest relationships occur at non-zero lags, consistent with anticipatory or reactive framing rather than instantaneous mirroring. Finally, using a separate real-time collection (February 2026), we observe synchronized increases in escalation-related narratives that coincide with documented geopolitical developments. Overall, the results show systematic cross-platform differences in narrative structure and tone, and provide quantitative evidence that online escalation signals can align with real-world developments with measurable temporal offsets.
CYNov 27, 2025
Cross-Platform Digital Discourse Analysis of the Israel-Hamas Conflict: Sentiment, Topics, and Event DynamicsDespoina Antonakaki, Sotiris Ioannidis
The Israeli-Palestinian conflict remains one of the most polarizing geopolitical issues, with the October 2023 escalation intensifying online debate. Social media platforms, particularly Telegram, have become central to real-time news sharing, advocacy, and propaganda. In this study, we analyze Telegram, Twitter/X, and Reddit to examine how conflict narratives are produced, amplified, and contested across different digital spheres. Building on our previous work on Telegram discourse during the 2023 escalation, we extend the analysis longitudinally and cross-platform using an updated dataset spanning October 2023 to mid-2025. The corpus includes more than 187,000 Telegram messages, 2.1 million Reddit comments, and curated Twitter/X posts. We combine Latent Dirichlet Allocation (LDA), BERTopic, and transformer-based sentiment and emotion models to identify dominant themes, emotional dynamics, and propaganda strategies. Telegram channels provide unfiltered, high-intensity documentation of events; Twitter/X amplifies frames to global audiences; and Reddit hosts more reflective and deliberative discussions. Our findings reveal persistent negative sentiment, strong coupling between humanitarian framing and solidarity expressions, and platform-specific pathways for the diffusion of pro-Palestinian and pro-Israeli narratives. This paper offers three contributions: (1) a multi-platform, FAIR-compliant dataset on the Israel-Hamas war, (2) an integrated pipeline combining topic modeling, sentiment and emotion analysis, and spam filtering for large-scale conflict discourse, and (3) empirical insights into how platform affordances and affective publics shape the evolution of digital conflict communication.
SIJan 30, 2025
Israel-Hamas war through Telegram, Reddit and TwitterDespoina Antonakaki, Sotiris Ioannidis
The Israeli-Palestinian conflict started on 7 October 2023, have resulted thus far to over 48,000 people killed including more than 17,000 children with a majority from Gaza, more than 30,000 people injured, over 10,000 missing, and over 1 million people displaced, fleeing conflict zones. The infrastructure damage includes the 87\% of housing units, 80\% of public buildings and 60\% of cropland 17 out of 36 hospitals, 68\% of road networks and 87\% of school buildings damaged. This conflict has as well launched an online discussion across various social media platforms. Telegram was no exception due to its encrypted communication and highly involved audience. The current study will cover an analysis of the related discussion in relation to different participants of the conflict and sentiment represented in those discussion. To this end, we prepared a dataset of 125K messages shared on channels in Telegram spanning from 23 October 2025 until today. Additionally, we apply the same analysis in two publicly available datasets from Twitter containing 2001 tweets and from Reddit containing 2M opinions. We apply a volume analysis across the three datasets, entity extraction and then proceed to BERT topic analysis in order to extract common themes or topics. Next, we apply sentiment analysis to analyze the emotional tone of the discussions. Our findings hint at polarized narratives as the hallmark of how political factions and outsiders mold public opinion. We also analyze the sentiment-topic prevalence relationship, detailing the trends that may show manipulation and attempts of propaganda by the involved parties. This will give a better understanding of the online discourse on the Israel-Palestine conflict and contribute to the knowledge on the dynamics of social media communication during geopolitical crises.
SIMay 31, 2023
BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipelineAlexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou et al.
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges associated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we evaluate 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10% in terms of F1-score, achieving an average score of 83.19% and 68.5% over specific and general approaches, respectively. As a result of this research, we provide one of the largest labeled Twitter bot datasets. The dataset contains extracted features combined with BotArtist predictions for 10,929,533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War over a 16-month period. This dataset was created based on [Shevtsov et al., 2022a] where the original authors share anonymized tweets discussing the Russo-Ukrainian war, totaling 127,275,386 tweets. The combination of the existing textual dataset and the provided labeled bot and human profiles will enable future development of more advanced bot detection large language models in the post-Twitter API era.
SIDec 8, 2021
Identification of Twitter Bots Based on an Explainable Machine Learning Framework: The US 2020 Elections Case StudyAlexander Shevtsov, Christos Tzagkarakis, Despoina Antonakaki et al.
Twitter is one of the most popular social networks attracting millions of users, while a considerable proportion of online discourse is captured. It provides a simple usage framework with short messages and an efficient application programming interface (API) enabling the research community to study and analyze several aspects of this social network. However, the Twitter usage simplicity can lead to malicious handling by various bots. The malicious handling phenomenon expands in online discourse, especially during the electoral periods, where except the legitimate bots used for dissemination and communication purposes, the goal is to manipulate the public opinion and the electorate towards a certain direction, specific ideology, or political party. This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data. To this end, a supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm, where the hyper-parameters are tuned via cross-validation. Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions by calculating feature importance, using the game theoretic-based Shapley values. Experimental evaluation on distinct Twitter datasets demonstrate the superiority of our approach, in terms of bot detection accuracy, when compared against a recent state-of-the-art Twitter bot detection method.