SOC-PHMay 10, 2022
On learning agent-based models from dataCorrado Monti, Marco Pangallo, Gianmarco De Francisci Morales et al.
Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. The first step of our protocol is to reduce an ABM to a probabilistic model, characterized by a computationally tractable likelihood. This reduction follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Then, our protocol proceeds by maximizing the likelihood of the latent variables via a gradient-based expectation maximization algorithm. We demonstrate our protocol by applying it to an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods. We demonstrate that the obtained model allows accurate estimates of the latent variables, while preserving the general behavior of the ABM. We also show that our estimates can be used for out-of-sample forecasting. Our protocol can be seen as an alternative to black-box data assimilation methods, that forces the modeler to lay bare the assumptions of the model, to think about the inferential process, and to spot potential identification problems.
LGAug 31, 2022
Learning Multiscale Non-stationary Causal StructuresGabriele D'Acunto, Gianmarco De Francisci Morales, Paolo Bajardi et al.
This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative model by leveraging results from spectral and causality theories. Our model allows sampling an MN-DAG according to user-specified priors on the time-dependence and multiscale properties of the causal graph. Secondly, we devise a Bayesian method named Multiscale Non-stationary Causal Structure Learner (MN-CASTLE) that uses stochastic variational inference to estimate MN-DAGs. The method also exploits information from the local partial correlation between time series over different time resolutions. The data generated from an MN-DAG reproduces well-known features of time series in different domains, such as volatility clustering and serial correlation. Additionally, we show the superior performance of MN-CASTLE on synthetic data with different multiscale and non-stationary properties compared to baseline models. Finally, we apply MN-CASTLE to identify the drivers of the natural gas prices in the US market. Causal relationships have strengthened during the COVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline methods fail to capture. MN-CASTLE identifies the causal impact of critical economic drivers on natural gas prices, such as seasonal factors, economic uncertainty, oil prices, and gas storage deviations.
LGOct 31, 2023
Extracting the Multiscale Causal Backbone of Brain DynamicsGabriele D'Acunto, Francesco Bonchi, Gianmarco De Francisci Morales et al.
The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics, shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it. Our approach leverages recent advances in multiscale causal structure learning and optimizes the trade-off between the model fit and its complexity. Empirical assessment on synthetic data shows the superiority of our methodology over a baseline based on canonical functional connectivity networks. When applied to resting-state fMRI data, we find sparse MCBs for both the left and right brain hemispheres. Thanks to its multiscale nature, our approach shows that at low-frequency bands, causal dynamics are driven by brain regions associated with high-level cognitive functions; at higher frequencies instead, nodes related to sensory processing play a crucial role. Finally, our analysis of individual multiscale causal structures confirms the existence of a causal fingerprint of brain connectivity, thus supporting the existing extensive research in brain connectivity fingerprinting from a causal perspective.
19.6CLMay 25
P1SCO: Social Dimensions from a Perspectivist LensAmanda Cercas Curry, Gianmarco de Francisci Morales, Luca Maria Aiello
We introduce P1SCO, a dataset of social media comments collected from three distinct platforms, annotated according to ten social dimensions to capture the diversity of social interactions and perceptions. The dataset is carefully disaggregated to allow analysis at the level of individual comments, annotators, and platforms. In addition to the social dimension labels, we include rich metadata on the annotators, including demographics, Big Five personality profiles, and political affiliation. This combination of comment-level annotations and annotator-level features enables nuanced analyses of how social perception varies across platforms, individual differences, and demographic factors. By preserving the diversity of annotator perspectives, our dataset supports studies of inter- and intra-annotator agreement, the influence of personality and political orientation on social interpretation, and the cross-platform dynamics of social discourse.
72.7SIApr 21
Among Us: Language of Conspiracy Theorists on Mainstream RedditFrancesco Corso, Giuseppe Russo, Francesco Pierri et al.
The interaction between fringe subcultures and mainstream online communities poses significant challenges for understanding discourse on social media. In this work, we investigate whether users active in conspiracy-focused communities exhibit detectable linguistic signatures when participating in general-interest spaces, such as news, humor, or hobbyist forums. We analyze a large-scale longitudinal dataset of over 500 million comments spanning 10 years of Reddit activity, examining the communication patterns of these users across diverse social contexts independent of the topics they discuss. We show that these users exhibit distinctive linguistic patterns that enable machine learning models to reliably distinguish them from the general population within individual communities (averaging 87\% accuracy across more than 20 binary classification tasks). Crucially, no single aggregate model captures these patterns across communities, as community-specific models outperform global classifiers by up to 17 percentage points. This result suggests that while these users are distinct, their linguistic expression is dynamic and highly responsive to the social norms of the environment they inhabit. Our findings suggest the need for tailored interventions in online spaces, as linguistic signals associated with conspiracy and fringe subcultures vary across communities and cannot be effectively addressed by uniform detection or moderation strategies.
16.1CYApr 3
Effects of Algorithmic Visibility on Conspiracy Communities: Reddit after Epstein's 'Suicide'Asja Attanasio, Francesco Corso, Gianmarco De Francisci Morales et al.
Following the death of Jeffrey Epstein, the subreddit r/conspiracy experienced a significant visibility shock that brought mainstream users into direct contact with established conspiracy narratives. In this work, we explore how large-scale surges in public attention reshape participation and discourse within online conspiracy communities. We ask whether a sudden increase in exposure changes who join r/conspiracy, how long they stay, and how they adapt linguistically, compared with users who arrive through organic discovery. Using a computational framework that combines toxicity scores, survival analysis, and lexical and semantic measures over a period of 12 months, we observe that mainstream visibility is is associated with patterns consistent with a selection mechanism rather than a simple amplifier. Users who join the conspiracy community during the arrest-period tend to show higher linguistic similarity to core users, especially regarding linguistic and thematic norms and showing more stable engagement over time. By contrast, users who arrive during the height of public visibility remain semantically distant from core discourse and participate more briefly. Overall, we find that mainstream visibility is connected with changes in audience size, community composition, and linguistic cohesion. However, incidental exposure during attention shocks does not typically produce durable, integrated community members. These results provide a more nuanced understanding of how external events and platform visibility influence the growth and evolution of conspiracy spaces, offering insights for the design of responsible and transparent recommendation systems.
CLNov 5, 2025
Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language ModelsFrancesco Corso, Francesco Pierri, Gianmarco De Francisci Morales
In this paper, we investigate whether Large Language Models (LLMs) exhibit conspiratorial tendencies, whether they display sociodemographic biases in this domain, and how easily they can be conditioned into adopting conspiratorial perspectives. Conspiracy beliefs play a central role in the spread of misinformation and in shaping distrust toward institutions, making them a critical testbed for evaluating the social fidelity of LLMs. LLMs are increasingly used as proxies for studying human behavior, yet little is known about whether they reproduce higher-order psychological constructs such as a conspiratorial mindset. To bridge this research gap, we administer validated psychometric surveys measuring conspiracy mindset to multiple models under different prompting and conditioning strategies. Our findings reveal that LLMs show partial agreement with elements of conspiracy belief, and conditioning with socio-demographic attributes produces uneven effects, exposing latent demographic biases. Moreover, targeted prompts can easily shift model responses toward conspiratorial directions, underscoring both the susceptibility of LLMs to manipulation and the potential risks of their deployment in sensitive contexts. These results highlight the importance of critically evaluating the psychological dimensions embedded in LLMs, both to advance computational social science and to inform possible mitigation strategies against harmful uses.
CYMar 8, 2024
Variational Inference of Parameters in Opinion Dynamics ModelsJacopo Lenti, Fabrizio Silvestri, Gianmarco De Francisci Morales
Despite the frequent use of agent-based models (ABMs) for studying social phenomena, parameter estimation remains a challenge, often relying on costly simulation-based heuristics. This work uses variational inference to estimate the parameters of an opinion dynamics ABM, by transforming the estimation problem into an optimization task that can be solved directly. Our proposal relies on probabilistic generative ABMs (PGABMs): we start by synthesizing a probabilistic generative model from the ABM rules. Then, we transform the inference process into an optimization problem suitable for automatic differentiation. In particular, we use the Gumbel-Softmax reparameterization for categorical agent attributes and stochastic variational inference for parameter estimation. Furthermore, we explore the trade-offs of using variational distributions with different complexity: normal distributions and normalizing flows. We validate our method on a bounded confidence model with agent roles (leaders and followers). Our approach estimates both macroscopic (bounded confidence intervals and backfire thresholds) and microscopic ($200$ categorical, agent-level roles) more accurately than simulation-based and MCMC methods. Consequently, our technique enables experts to tune and validate their ABMs against real-world observations, thus providing insights into human behavior in social systems via data-driven analysis.
SIJun 11, 2025
Alice and the Caterpillar: A more descriptive null model for assessing data mining resultsGiulia Preti, Gianmarco De Francisci Morales, Matteo Riondato
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.
40.9CYApr 7
Conditional Publics: Shared Events and Divergent Meanings in the European Twitter Debate on the Ukraine WarCorrado Monti, Arthur Capozzi, Yelena Mejova et al.
How do European publics debate a geopolitical crisis on social media, and do they inhabit a shared informational reality? We analyze over 38 million geolocated tweets from 20 European countries during the first eight months of the Russian invasion of Ukraine. Using retweet community detection and stance annotation across six issues, we identify 'hawkish' and 'doveish' opinion clusters present within almost every country studied. We find that structural polarization is driven not by radicalization, but by the exit of casual users. Crucially, whether opposing sides orient to the same events depends on the issue. On pragmatist issues, both sides react to the same high-profile events, forming an agonistic public sphere. Instead, on interpretive issues, they operate as affective publics and counterpublics constructing divergent meanings. We propose conditional publics to describe formations whose relational structure, sharing or fracturing a referential frame, depends on the epistemic character of the debated issue.
CLJan 25
Beyond the Rabbit Hole: Mapping the Relational Harms of QAnon RadicalizationBich Ngoc, Doan, Giuseppe Russo et al.
The rise of conspiracy theories has created far-reaching societal harm in the public discourse by eroding trust and fueling polarization. Beyond this public impact lies a deeply personal toll on the friends and families of conspiracy believers, a dimension often overlooked in large-scale computational research. This study fills this gap by systematically mapping radicalization journeys and quantifying the associated emotional toll inflicted on loved ones. We use the prominent case of QAnon as a case study, analyzing 12747 narratives from the r/QAnonCasualties support community through a novel mixed-methods approach. First, we use topic modeling (BERTopic) to map the radicalization trajectories, identifying key pre-existing conditions, triggers, and post-radicalization characteristics. From this, we apply an LDA-based graphical model to uncover six recurring archetypes of QAnon adherents, which we term "radicalization personas." Finally, using LLM-assisted emotion detection and regression modeling, we link these personas to the specific emotional toll reported by narrators. Our findings reveal that these personas are not just descriptive; they are powerful predictors of the specific emotional harms experienced by narrators. Radicalization perceived as a deliberate ideological choice is associated with narrator anger and disgust, while those marked by personal and cognitive collapse are linked to fear and sadness. This work provides the first empirical framework for understanding radicalization as a relational phenomenon, offering a vital roadmap for researchers and practitioners to navigate its interpersonal fallout.
LGSep 22, 2025
Comparing Data Assimilation and Likelihood-Based Inference on Latent State Estimation in Agent-Based ModelsBlas Kolic, Corrado Monti, Gianmarco De Francisci Morales et al.
In this paper, we present the first systematic comparison of Data Assimilation (DA) and Likelihood-Based Inference (LBI) in the context of Agent-Based Models (ABMs). These models generate observable time series driven by evolving, partially-latent microstates. Latent states need to be estimated to align simulations with real-world data -- a task traditionally addressed by DA, especially in continuous and equation-based models such as those used in weather forecasting. However, the nature of ABMs poses challenges for standard DA methods. Solving such issues requires adaptation of previous DA techniques, or ad-hoc alternatives such as LBI. DA approximates the likelihood in a model-agnostic way, making it broadly applicable but potentially less precise. In contrast, LBI provides more accurate state estimation by directly leveraging the model's likelihood, but at the cost of requiring a hand-crafted, model-specific likelihood function, which may be complex or infeasible to derive. We compare the two methods on the Bounded-Confidence Model, a well-known opinion dynamics ABM, where agents are affected only by others holding sufficiently similar opinions. We find that LBI better recovers latent agent-level opinions, even under model mis-specification, leading to improved individual-level forecasts. At the aggregate level, however, both methods perform comparably, and DA remains competitive across levels of aggregation under certain parameter settings. Our findings suggest that DA is well-suited for aggregate predictions, while LBI is preferable for agent-level inference.
MEJun 13, 2025
Bias and Identifiability in the Bounded Confidence ModelClaudio Borile, Jacopo Lenti, Valentina Ghidini et al.
Opinion dynamics models such as the bounded confidence models (BCMs) describe how a population can reach consensus, fragmentation, or polarization, depending on a few parameters. Connecting such models to real-world data could help understanding such phenomena, testing model assumptions. To this end, estimation of model parameters is a key aspect, and maximum likelihood estimation provides a principled way to tackle it. Here, our goal is to outline the properties of statistical estimators of the two key BCM parameters: the confidence bound and the convergence rate. We find that their maximum likelihood estimators present different characteristics: the one for the confidence bound presents a small-sample bias but is consistent, while the estimator of the convergence rate shows a persistent bias. Moreover, the joint parameter estimation is affected by identifiability issues for specific regions of the parameter space, as several local maxima are present in the likelihood function. Our results show how the analysis of the likelihood function is a fruitful approach for better understanding the pitfalls and possibilities of estimating the parameters of opinion dynamics models, and more in general, agent-based models, and for offering formal guarantees for their calibration.
SIJun 2, 2020
Learning Opinion Dynamics From Social TracesCorrado Monti, Gianmarco De Francisci Morales, Francesco Bonchi
Opinion dynamics - the research field dealing with how people's opinions form and evolve in a social context - traditionally uses agent-based models to validate the implications of sociological theories. These models encode the causal mechanism that drives the opinion formation process, and have the advantage of being easy to interpret. However, as they do not exploit the availability of data, their predictive power is limited. Moreover, parameter calibration and model selection are manual and difficult tasks. In this work we propose an inference mechanism for fitting a generative, agent-like model of opinion dynamics to real-world social traces. Given a set of observables (e.g., actions and interactions between agents), our model can recover the most-likely latent opinion trajectories that are compatible with the assumptions about the process dynamics. This type of model retains the benefits of agent-based ones (i.e., causal interpretation), while adding the ability to perform model selection and hypothesis testing on real data. We showcase our proposal by translating a classical agent-based model of opinion dynamics into its generative counterpart. We then design an inference algorithm based on online expectation maximization to learn the latent parameters of the model. Such algorithm can recover the latent opinion trajectories from traces generated by the classical agent-based model. In addition, it can identify the most likely set of macro parameters used to generate a data trace, thus allowing testing of sociological hypotheses. Finally, we apply our model to real-world data from Reddit to explore the long-standing question about the impact of backfire effect. Our results suggest a low prominence of the effect in Reddit's political conversation.
CLOct 4, 2019
Predicting the Role of Political Trolls in Social MediaAtanas Atanasov, Gianmarco De Francisci Morales, Preslav Nakov
We investigate the political roles of "Internet trolls" in social media. Political trolls, such as the ones linked to the Russian Internet Research Agency (IRA), have recently gained enormous attention for their ability to sway public opinion and even influence elections. Analysis of the online traces of trolls has shown different behavioral patterns, which target different slices of the population. However, this analysis is manual and labor-intensive, thus making it impractical as a first-response tool for newly-discovered troll farms. In this paper, we show how to automate this analysis by using machine learning in a realistic setting. In particular, we show how to classify trolls according to their political role ---left, news feed, right--- by using features extracted from social media, i.e., Twitter, in two scenarios: (i) in a traditional supervised learning scenario, where labels for trolls are available, and (ii) in a distant supervision scenario, where labels for trolls are not available, and we rely on more-commonly-available labels for news outlets mentioned by the trolls. Technically, we leverage the community structure and the text of the messages in the online social network of trolls represented as a graph, from which we extract several types of learned representations, i.e.,~embeddings, for the trolls. Experiments on the "IRA Russian Troll" dataset show that our methodology improves over the state-of-the-art in the first scenario, while providing a compelling case for the second scenario, which has not been explored in the literature thus far.
SIFeb 8, 2019
Link Prediction via Higher-Order Motif FeaturesGhadeer Abuoda, Gianmarco De Francisci Morales, Ashraf Aboulnaga
Link prediction requires predicting which new links are likely to appear in a graph. Being able to predict unseen links with good accuracy has important applications in several domains such as social media, security, transportation, and recommendation systems. A common approach is to use features based on the common neighbors of an unconnected pair of nodes to predict whether the pair will form a link in the future. In this paper, we present an approach for link prediction that relies on higher-order analysis of the graph topology, well beyond common neighbors. We treat the link prediction problem as a supervised classification problem, and we propose a set of features that depend on the patterns or motifs that a pair of nodes occurs in. By using motifs of sizes 3, 4, and 5, our approach captures a high level of detail about the graph topology within the neighborhood of the pair of nodes, which leads to a higher classification accuracy. In addition to proposing the use of motif-based features, we also propose two optimizations related to constructing the classification dataset from the graph. First, to ensure that positive and negative examples are treated equally when extracting features, we propose adding the negative examples to the graph as an alternative to the common approach of removing the positive ones. Second, we show that it is important to control for the shortest-path distance when sampling pairs of nodes to form negative examples, since the difficulty of prediction varies with the shortest-path distance. We experimentally demonstrate that using off-the-shelf classifiers with a well constructed classification dataset results in up to 10 percentage points increase in accuracy over prior topology-based and feature learning methods.
DCJul 28, 2016
VHT: Vertical Hoeffding TreeNicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet et al.
IoT Big Data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, we can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining distributed data streams, and thus able to run on real-world clusters. We run several experiments to study the accuracy and throughput performance of our new VHT algorithm, as well as its ability to scale while keeping its superior performance with respect to non-distributed decision trees.
CLJul 18, 2013
Says who? Automatic Text-Based Content Analysis of Television NewsCarlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza et al.
We perform an automatic analysis of television news programs, based on the closed captions that accompany them. Specifically, we collect all the news broadcasted in over 140 television channels in the US during a period of six months. We start by segmenting, processing, and annotating the closed captions automatically. Next, we focus on the analysis of their linguistic style and on mentions of people using NLP methods. We present a series of key insights about news providers, people in the news, and we discuss the biases that can be uncovered by automatic means. These insights are contrasted by looking at the data from multiple points of view, including qualitative assessment.