CLMay 3, 2022
A Holistic Framework for Analyzing the COVID-19 Vaccine DebateMaria Leonor Pacheco, Tunazzina Islam, Monal Mahajan et al.
The Covid-19 pandemic has led to infodemic of low quality information leading to poor health decisions. Combating the outcomes of this infodemic is not only a question of identifying false claims, but also reasoning about the decisions individuals make. In this work we propose a holistic analysis framework connecting stance and reason analysis, and fine-grained entity level moral sentiment analysis. We study how to model the dependencies between the different level of analysis and incorporate human insights into the learning process. Experiments show that our framework provides reliable predictions even in the low-supervision settings.
CLOct 18, 2022
Understanding COVID-19 Vaccine Campaign on Facebook using Minimal SupervisionTunazzina Islam, Dan Goldwasser
In the age of social media, where billions of internet users share information and opinions, the negative impact of pandemics is not limited to the physical world. It provokes a surge of incomplete, biased, and incorrect information, also known as an infodemic. This global infodemic jeopardizes measures to control the pandemic by creating panic, vaccine hesitancy, and fragmented social response. Platforms like Facebook allow advertisers to adapt their messaging to target different demographics and help alleviate or exacerbate the infodemic problem depending on their content. In this paper, we propose a minimally supervised multi-task learning framework for understanding messaging on Facebook related to the COVID vaccine by identifying ad themes and moral foundations. Furthermore, we perform a more nuanced thematic analysis of messaging tactics of vaccine campaigns on social media so that policymakers can make better decisions on pandemic control.
CLOct 19, 2022
Weakly Supervised Learning for Analyzing Political Campaigns on FacebookTunazzina Islam, Shamik Roy, Dan Goldwasser
Social media platforms are currently the main channel for political messaging, allowing politicians to target specific demographics and adapt based on their reactions. However, making this communication transparent is challenging, as the messaging is tightly coupled with its intended audience and often echoed by multiple stakeholders interested in advancing specific policies. Our goal in this paper is to take a first step towards understanding these highly decentralized settings. We propose a weakly supervised approach to identify the stance and issue of political ads on Facebook and analyze how political campaigns use some kind of demographic targeting by location, gender, or age. Furthermore, we analyze the temporal dynamics of the political ads on election polls.
CLMar 25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QASaahil Mathur, Ryan David Rittner, Vedant Ajit Thakur et al.
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
CLJan 23
Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted TextTunazzina Islam
Large language models (LLMs) are increasingly capable of generating personalized, persuasive text at scale, raising new questions about bias and fairness in automated communication. This paper presents the first systematic analysis of how LLMs behave when tasked with demographic-conditioned targeted messaging. We introduce a controlled evaluation framework using three leading models -- GPT-4o, Llama-3.3, and Mistral-Large 2.1 -- across two generation settings: Standalone Generation, which isolates intrinsic demographic effects, and Context-Rich Generation, which incorporates thematic and regional context to emulate realistic targeting. We evaluate generated messages along three dimensions: lexical content, language style, and persuasive framing. We instantiate this framework on climate communication and find consistent age- and gender-based asymmetries across models: male- and youth-targeted messages emphasize agency, innovation, and assertiveness, while female- and senior-targeted messages stress warmth, care, and tradition. Contextual prompts systematically amplify these disparities, with persuasion scores significantly higher for messages tailored to younger or male audiences. Our findings demonstrate how demographic stereotypes can surface and intensify in LLM-generated targeted communication, underscoring the need for bias-aware generation pipelines and transparent auditing frameworks that explicitly account for demographic conditioning in socially sensitive applications.
CLMar 15, 2024
Discovering Latent Themes in Social Media Messaging: A Machine-in-the-Loop Approach Integrating LLMsTunazzina Islam, Dan Goldwasser
Grasping the themes of social media content is key to understanding the narratives that influence public opinion and behavior. The thematic analysis goes beyond traditional topic-level analysis, which often captures only the broadest patterns, providing deeper insights into specific and actionable themes such as "public sentiment towards vaccination", "political discourse surrounding climate policies," etc. In this paper, we introduce a novel approach to uncovering latent themes in social media messaging. Recognizing the limitations of the traditional topic-level analysis, which tends to capture only overarching patterns, this study emphasizes the need for a finer-grained, theme-focused exploration. Traditional theme discovery methods typically involve manual processes and a human-in-the-loop approach. While valuable, these methods face challenges in scalability, consistency, and resource intensity in terms of time and cost. To address these challenges, we propose a machine-in-the-loop approach that leverages the advanced capabilities of Large Language Models (LLMs). To demonstrate our approach, we apply our framework to contentious topics, such as climate debate and vaccine debate. We use two publicly available datasets: (1) the climate campaigns dataset of 21k Facebook ads and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads. Our quantitative and qualitative analysis shows that our methodology yields more accurate and interpretable results compared to the baselines. Our results not only demonstrate the effectiveness of our approach in uncovering latent themes but also illuminate how these themes are tailored for demographic targeting in social media contexts. Additionally, our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.
CLApr 16, 2024
Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop StrategyTunazzina Islam, Dan Goldwasser
The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic LLMs-in-the-Loop strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Additionally, we design a downstream task as stance prediction by leveraging talking points in climate debates. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.
CLFeb 4, 2025
Can LLMs Assist Annotators in Identifying Morality Frames? -- Case Study on Vaccination Debate on Social MediaTunazzina Islam, Dan Goldwasser
Nowadays, social media is pivotal in shaping public discourse, especially on polarizing issues like vaccination, where diverse moral perspectives influence individual opinions. In NLP, data scarcity and complexity of psycholinguistic tasks, such as identifying morality frames, make relying solely on human annotators costly, time-consuming, and prone to inconsistency due to cognitive load. To address these issues, we leverage large language models (LLMs), which are adept at adapting new tasks through few-shot learning, utilizing a handful of in-context examples coupled with explanations that connect examples to task principles. Our research explores LLMs' potential to assist human annotators in identifying morality frames within vaccination debates on social media. We employ a two-step process: generating concepts and explanations with LLMs, followed by human evaluation using a "think-aloud" tool. Our study shows that integrating LLMs into the annotation process enhances accuracy, reduces task difficulty, lowers cognitive load, suggesting a promising avenue for human-AI collaboration in complex psycholinguistic tasks.
CLApr 8
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMsTunazzina Islam
Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.
CLJan 19
Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate DiscourseSamantha Sudhoff, Pranav Perumal, Zhaoqing Wu et al.
Climate discourse online plays a crucial role in shaping public understanding of climate change and influencing political and policy outcomes. However, climate communication unfolds across structurally distinct platforms with fundamentally different incentive structures: paid advertising ecosystems incentivize targeted, strategic persuasion, while public social media platforms host largely organic, user-driven discourse. Existing computational studies typically analyze these environments in isolation, limiting our ability to distinguish institutional messaging from public expression. In this work, we present a comparative analysis of climate discourse across paid advertisements on Meta (previously known as Facebook) and public posts on Bluesky from July 2024 to September 2025. We introduce an interpretable, end-to-end thematic discovery and assignment framework that clusters texts by semantic similarity and leverages large language models (LLMs) to generate concise, human-interpretable theme labels. We evaluate the quality of the induced themes against traditional topic modeling baselines using both human judgments and an LLM-based evaluator, and further validate their semantic coherence through downstream stance prediction and theme-guided retrieval tasks. Applying the resulting themes, we characterize systematic differences between paid climate messaging and public climate discourse and examine how thematic prevalence shifts around major political events. Our findings show that platform-level incentives are reflected in the thematic structure, stance alignment, and temporal responsiveness of climate narratives. While our empirical analysis focuses on climate communication, the proposed framework is designed to support comparative narrative analysis across heterogeneous communication environments.
CLOct 16, 2025
Latent Topic Synthesis: Leveraging LLMs for Electoral Ad AnalysisAlexander Brady, Tunazzina Islam
Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.
CLMay 8, 2023
Interactive Concept Learning for Uncovering Latent Themes in Large Text CollectionsMaria Leonor Pacheco, Tunazzina Islam, Lyle Ungar et al.
Experts across diverse disciplines are often interested in making sense of large text collections. Traditionally, this challenge is approached either by noisy unsupervised techniques such as topic models, or by following a manual theme discovery process. In this paper, we expand the definition of a theme to account for more than just a word distribution, and include generalized concepts deemed relevant by domain experts. Then, we propose an interactive framework that receives and encodes expert feedback at different levels of abstraction. Our framework strikes a balance between automation and manual coding, allowing experts to maintain control of their study while reducing the manual effort required.
CLMay 6, 2023
Analysis of Climate Campaigns on Social Media using Bayesian Model AveragingTunazzina Islam, Ruqi Zhang, Dan Goldwasser
Climate change is the defining issue of our time, and we are at a defining moment. Various interest groups, social movement organizations, and individuals engage in collective action on this issue on social media. In addition, issue advocacy campaigns on social media often arise in response to ongoing societal concerns, especially those faced by energy industries. Our goal in this paper is to analyze how those industries, their advocacy group, and climate advocacy group use social media to influence the narrative on climate change. In this work, we propose a minimally supervised model soup [57] approach combined with messaging themes to identify the stances of climate ads on Facebook. Finally, we release our stance dataset, model, and set of themes related to climate campaigns for future work on opinion mining and the automatic detection of climate change stances.
CLAug 20, 2021
Twitter User Representation Using Weakly Supervised Graph EmbeddingTunazzina Islam, Dan Goldwasser
Social media platforms provide convenient means for users to participate in multiple online activities on various contents and create fast widespread interactions. However, this rapidly growing access has also increased the diverse information, and characterizing user types to understand people's lifestyle decisions shared in social media is challenging. In this paper, we propose a weakly supervised graph embedding based framework for understanding user types. We evaluate the user embedding learned using weak supervision over well-being related tweets from Twitter, focusing on 'Yoga', 'Keto diet'. Experiments on real-world datasets demonstrate that the proposed framework outperforms the baselines for detecting user types. Finally, we illustrate data analysis on different types of users (e.g., practitioner vs. promotional) from our dataset. While we focus on lifestyle-related tweets (i.e., yoga, keto), our method for constructing user representation readily generalizes to other domains.
CLApr 7, 2021
Analysis of Twitter Users' Lifestyle Choices using Joint Embedding ModelTunazzina Islam, Dan Goldwasser
Multiview representation learning of data can help construct coherent and contextualized users' representations on social media. This paper suggests a joint embedding model, incorporating users' social and textual information to learn contextualized user representations used for understanding their lifestyle choices. We apply our model to tweets related to two lifestyle activities, `Yoga' and `Keto diet' and use it to analyze users' activity type and motivation. We explain the data collection and annotation process in detail and provide an in-depth analysis of users from different classes based on their Twitter content. Our experiments show that our model results in performance improvements in both domains.
CLDec 17, 2020
Do You Do Yoga? Understanding Twitter Users' Types and Motivations using Social and Textual InformationTunazzina Islam, Dan Goldwasser
Leveraging social media data to understand people's lifestyle choices is an exciting domain to explore but requires a multiview formulation of the data. In this paper, we propose a joint embedding model based on the fusion of neural networks with attention mechanism by incorporating social and textual information of users to understand their activities and motivations. We use well-being related tweets from Twitter, focusing on 'Yoga'. We demonstrate our model on two downstream tasks: (i) finding user type such as either practitioner or promotional (promoting yoga studio/gym), other; (ii) finding user motivation i.e. health benefit, spirituality, love to tweet/retweet about yoga but do not practice yoga.
CLDec 5, 2020
Does Yoga Make You Happy? Analyzing Twitter User Happiness using Textual and Temporal InformationTunazzina Islam, Dan Goldwasser
Although yoga is a multi-component practice to hone the body and mind and be known to reduce anxiety and depression, there is still a gap in understanding people's emotional state related to yoga in social media. In this study, we investigate the causal relationship between practicing yoga and being happy by incorporating textual and temporal information of users using Granger causality. To find out causal features from the text, we measure two variables (i) Yoga activity level based on content analysis and (ii) Happiness level based on emotional state. To understand users' yoga activity, we propose a joint embedding model based on the fusion of neural networks with attention mechanism by leveraging users' social and textual information. For measuring the emotional state of yoga users (target domain), we suggest a transfer learning approach to transfer knowledge from an attention-based neural network model trained on a source domain. Our experiment on Twitter dataset demonstrates that there are 1447 users where "yoga Granger-causes happiness".
CLJun 15, 2019
Yoga-Veganism: Correlation Mining of Twitter Health DataTunazzina Islam
Nowadays social media is a huge platform of data. People usually share their interest, thoughts via discussions, tweets, status. It is not possible to go through all the data manually. We need to mine the data to explore hidden patterns or unknown correlations, find out the dominant topic in data and understand people's interest through the discussions. In this work, we explore Twitter data related to health. We extract the popular topics under different categories (e.g. diet, exercise) discussed in Twitter via topic modeling, observe model behavior on new tweets, discover interesting correlation (i.e. Yoga-Veganism). We evaluate accuracy by comparing with ground truth using manual annotation both for train and test data.
CLMay 24, 2019
Ex-Twit: Explainable Twitter Mining on Health DataTunazzina Islam
Since most machine learning models provide no explanations for the predictions, their predictions are obscure for the human. The ability to explain a model's prediction has become a necessity in many applications including Twitter mining. In this work, we propose a method called Explainable Twitter Mining (Ex-Twit) combining Topic Modeling and Local Interpretable Model-agnostic Explanation (LIME) to predict the topic and explain the model predictions. We demonstrate the effectiveness of Ex-Twit on Twitter health-related data.