Jeremy Blackburn

CY
h-index46
21papers
1,525citations
Novelty39%
AI Score44

21 Papers

CYSep 7, 2022
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Wai Man Si, Michael Backes, Jeremy Blackburn et al.

Chatbots are used in many applications, e.g., automated agents, smart home assistants, interactive characters in online games, etc. Therefore, it is crucial to ensure they do not behave in undesired manners, providing offensive or toxic responses to users. This is not a trivial task as state-of-the-art chatbot models are trained on large, public datasets openly collected from the Internet. This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. Even more worryingly, some non-toxic queries can trigger toxic responses too. We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries that make chatbots respond in a toxic manner. Our extensive experimental evaluation demonstrates that our attack is effective against public chatbot models and outperforms manually-crafted malicious queries proposed by previous work. We also evaluate three defense mechanisms against ToxicBuddy, showing that they either reduce the attack performance at the cost of affecting the chatbot's utility or are only effective at mitigating a portion of the attack. This highlights the need for more research from the computer security and online safety communities to ensure that chatbot models do not hurt their users. Overall, we are confident that ToxicBuddy can be used as an auditing tool and that our work will pave the way toward designing more effective defenses for chatbot safety.

CVJul 30, 2024
PIXELMOD: Improving Soft Moderation of Visual Misleading Information on Twitter

Pujan Paudel, Chen Ling, Jeremy Blackburn et al.

Images are a powerful and immediate vehicle to carry misleading or outright false messages, yet identifying image-based misinformation at scale poses unique challenges. In this paper, we present PIXELMOD, a system that leverages perceptual hashes, vector databases, and optical character recognition (OCR) to efficiently identify images that are candidates to receive soft moderation labels on Twitter. We show that PIXELMOD outperforms existing image similarity approaches when applied to soft moderation, with negligible performance overhead. We then test PIXELMOD on a dataset of tweets surrounding the 2020 US Presidential Election, and find that it is able to identify visually misleading images that are candidates for soft moderation with 0.99% false detection and 2.06% false negatives.

ROSep 19, 2024
Vision Language Models Can Parse Floor Plan Maps

David DeFazio, Hrudayangam Mehta, Meng Wang et al.

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floor plan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://sites.google.com/view/vlm-floorplan/

CLSep 22, 2025Code
Evaluating Large Language Models for Detecting Antisemitism

Jay Patel, Hrudayangam Mehta, Jeremy Blackburn

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations

AIMay 17, 2025
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

Shuai Yang, Qi Yang, Luoxi Tang et al.

Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.

CLFeb 15, 2025
Evolving Hate Speech Online: An Adaptive Framework for Detection and Mitigation

Shiza Ali, Jeremy Blackburn, Gianluca Stringhini

The proliferation of social media platforms has led to an increase in the spread of hate speech, particularly targeting vulnerable communities. Unfortunately, existing methods for automatically identifying and blocking toxic language rely on pre-constructed lexicons, making them reactive rather than adaptive. As such, these approaches become less effective over time, especially when new communities are targeted with slurs not included in the original datasets. To address this issue, we present an adaptive approach that uses word embeddings to update lexicons and develop a hybrid model that adjusts to emerging slurs and new linguistic patterns. This approach can effectively detect toxic language, including intentional spelling mistakes employed by aggressors to avoid detection. Our hybrid model, which combines BERT with lexicon-based techniques, achieves an accuracy of 95% for most state-of-the-art datasets. Our work has significant implications for creating safer online environments by improving the detection of toxic content and proactively updating the lexicon. Content Warning: This paper contains examples of hate speech that may be triggering.

ROMar 13
From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication

Yohei Hayamizu, David DeFazio, Hrudayangam Mehta et al.

Assistive robotics is an important subarea of robotics that focuses on the well-being of people with disabilities. A robotic guide dog is an assistive quadruped robot that helps visually impaired people in obstacle avoidance and navigation. Enabling language capabilities for robotic guide dogs goes beyond naively adding an existing dialog system onto a mobile robot. The novel challenges include grounding language in the dynamically changing environment and improving spatial awareness for the human handler. To address those challenges, we develop a novel dialog system for robotic guide dogs that uses LLMs to verbalize both navigational plans and scenes. The goal is to enable verbal communication for collaborative decision-making within the handler-robot team. In experiments, we conducted a human study to evaluate different verbalization strategies and a simulation study to assess the efficiency and accuracy in navigation tasks.

CYFeb 17, 2022
Feels Bad Man: Dissecting Automated Hateful Meme Detection Through the Lens of Facebook's Challenge

Catherine Jennifer, Fatemeh Tahmasbi, Jeremy Blackburn et al.

Internet memes have become a dominant method of communication; at the same time, however, they are also increasingly being used to advocate extremism and foster derogatory beliefs. Nonetheless, we do not have a firm understanding as to which perceptual aspects of memes cause this phenomenon. In this work, we assess the efficacy of current state-of-the-art multimodal machine learning models toward hateful meme detection, and in particular with respect to their generalizability across platforms. We use two benchmark datasets comprising 12,140 and 10,567 images from 4chan's "Politically Incorrect" board (/pol/) and Facebook's Hateful Memes Challenge dataset to train the competition's top-ranking machine learning models for the discovery of the most prominent features that distinguish viral hateful memes from benign ones. We conduct three experiments to determine the importance of multimodality on classification performance, the influential capacity of fringe Web communities on mainstream social platforms and vice versa, and the models' learning transferability on 4chan memes. Our experiments show that memes' image characteristics provide a greater wealth of information than its textual content. We also find that current systems developed for online detection of hate speech in memes necessitate further concentration on its visual elements to improve their interpretation of underlying cultural connotations, implying that multimodal models fail to adequately grasp the intricacies of hate speech in memes and generalize across social media platforms.

CRDec 1, 2021
TROLLMAGNIFIER: Detecting State-Sponsored Troll Accounts on Reddit

Mohammad Hammas Saeed, Shiza Ali, Jeremy Blackburn et al.

Growing evidence points to recurring influence campaigns on social media, often sponsored by state actors aiming to manipulate public opinion on sensitive political topics. Typically, campaigns are performed through instrumented accounts, known as troll accounts; despite their prominence, however, little work has been done to detect these accounts in the wild. In this paper, we present TROLLMAGNIFIER, a detection system for troll accounts. Our key observation, based on analysis of known Russian-sponsored troll accounts identified by Reddit, is that they show loose coordination, often interacting with each other to further specific narratives. Therefore, troll accounts controlled by the same actor often show similarities that can be leveraged for detection. TROLLMAGNIFIER learns the typical behavior of known troll accounts and identifies more that behave similarly. We train TROLLMAGNIFIER on a set of 335 known troll accounts and run it on a large dataset of Reddit accounts. Our system identifies 1,248 potential troll accounts; we then provide a multi-faceted analysis to corroborate the correctness of our classification. In particular, 66% of the detected accounts show signs of being instrumented by malicious actors (e.g., they were created on the same exact day as a known troll, they have since been suspended by Reddit, etc.). They also discuss similar topics as the known troll accounts and exhibit temporal synchronization in their activity. Overall, we show that using TROLLMAGNIFIER, one can grow the initial knowledge of potential trolls provided by Reddit by over 300%.

CYNov 3, 2021
Slapping Cats, Bopping Heads, and Oreo Shakes: Understanding Indicators of Virality in TikTok Short Videos

Chen Ling, Jeremy Blackburn, Emiliano De Cristofaro et al.

Short videos have become one of the leading media used by younger generations to express themselves online and thus a driving force in shaping online culture. In this context, TikTok has emerged as a platform where viral videos are often posted first. In this paper, we study what elements of short videos posted on TikTok contribute to their virality. We apply a mixed-method approach to develop a codebook and identify important virality features. We do so vis-à-vis three research hypotheses; namely, that: 1) the video content, 2) TikTok's recommendation algorithm, and 3) the popularity of the video creator contribute to virality. We collect and label a dataset of 400 TikTok videos and train classifiers to help us identify the features that influence virality the most. While the number of followers is the most powerful predictor, close-up and medium-shot scales also play an essential role. So does the lifespan of the video, the presence of text, and the point of view. Our research highlights the characteristics that distinguish viral from non-viral TikTok videos, laying the groundwork for developing additional approaches to create more engaging online content and proactively identify possibly risky content that is likely to reach a large audience.

HCJan 16, 2021
Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

Chen Ling, Ihab AbuHilal, Jeremy Blackburn et al.

Despite the increasingly important role played by image memes, we do not yet have a solid understanding of the elements that might make a meme go viral on social media. In this paper, we investigate what visual elements distinguish image memes that are highly viral on social media from those that do not get re-shared, across three dimensions: composition, subjects, and target audience. Drawing from research in art theory, psychology, marketing, and neuroscience, we develop a codebook to characterize image memes, and use it to annotate a set of 100 image memes collected from 4chan's Politically Incorrect Board (/pol/). On the one hand, we find that highly viral memes are more likely to use a close-up scale, contain characters, and include positive or negative emotions. On the other hand, image memes that do not present a clear subject the viewer can focus attention on, or that include long text are not likely to be re-shared by users. We train machine learning models to distinguish between image memes that are likely to go viral and those that are unlikely to be re-shared, obtaining an AUC of 0.866 on our dataset. We also show that the indicators of virality identified by our model can help characterize the most viral memes posted on mainstream online social networks too, as our classifiers are able to predict 19 out of the 20 most popular image memes posted on Twitter and Reddit between 2016 and 2018. Overall, our analysis sheds light on what indicators characterize viral and non-viral visual content online, and set the basis for developing better techniques to create or moderate content that is more likely to catch the viewer's attention.

SIJul 20, 2019
Detecting Cyberbullying and Cyberaggression in Social Media

Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn et al.

Cyberbullying and cyberaggression are increasingly worrisome phenomena affecting people across all demographics. More than half of young social media users worldwide have been exposed to such prolonged and/or coordinated digital harassment. Victims can experience a wide range of emotions, with negative consequences such as embarrassment, depression, isolation from other community members, which embed the risk to lead to even more critical consequences, such as suicide attempts. In this work, we take the first concrete steps to understand the characteristics of abusive behavior in Twitter, one of today's largest social media platforms. We analyze 1.2 million users and 2.1 million tweets, comparing users participating in discussions around seemingly normal topics like the NBA, to those more likely to be hate-related, such as the Gamergate controversy, or the gender pay inequality at the BBC station. We also explore specific manifestations of abusive behavior, i.e., cyberbullying and cyberaggression, in one of the hate-related communities (Gamergate). We present a robust methodology to distinguish bullies and aggressors from normal Twitter users by considering text, user, and network-based attributes. Using various state-of-the-art machine learning algorithms, we classify these accounts with over 90% accuracy and AUC. Finally, we discuss the current status of Twitter user accounts marked as abusive by our methodology, and study the performance of potential mechanisms that can be used by Twitter to suspend users in the future.

NIFeb 7, 2019
EYEORG: A Platform For Crowdsourcing Web Quality Of Experience Measurements

Matteo Varvello, Jeremy Blackburn, David Naylor et al.

Tremendous effort has gone into the ongoing battle to make webpages load faster. This effort has culminated in new protocols (QUIC, SPDY, and HTTP/2) as well as novel content delivery mechanisms. In addition, companies like Google and SpeedCurve investigated how to measure "page load time" (PLT) in a way that captures human perception. In this paper we present Eyeorg, a platform for crowdsourcing web quality of experience measurements. Eyeorg overcomes the scaling and automation challenges of recruiting users and collecting consistent user-perceived quality measurements. We validate Eyeorg's capabilities via a set of 100 trusted participants. Next, we showcase its functionalities via three measurement campaigns, each involving 1,000 paid participants, to 1) study the quality of several PLT metrics, 2) compare HTTP/1.1 and HTTP/2 performance, and 3) assess the impact of online advertisements and ad blockers on user experience. We find that commonly used, and even novel and sophisticated PLT metrics fail to represent actual human perception of PLT, that the performance gains from HTTP/2 are imperceivable in some circumstances, and that not all ad blockers are created equal.

CYMay 21, 2018
"You Know What to Do": Proactive Detection of YouTube Videos Targeted by Coordinated Hate Attacks

Enrico Mariconti, Guillermo Suarez-Tangil, Jeremy Blackburn et al.

Video sharing platforms like YouTube are increasingly targeted by aggression and hate attacks. Prior work has shown how these attacks often take place as a result of "raids," i.e., organized efforts by ad-hoc mobs coordinating from third-party communities. Despite the increasing relevance of this phenomenon, however, online services often lack effective countermeasures to mitigate it. Unlike well-studied problems like spam and phishing, coordinated aggressive behavior both targets and is perpetrated by humans, making defense mechanisms that look for automated activity unsuitable. Therefore, the de-facto solution is to reactively rely on user reports and human moderation. In this paper, we propose an automated solution to identify YouTube videos that are likely to be targeted by coordinated harassers from fringe communities like 4chan. First, we characterize and model YouTube videos along several axes (metadata, audio transcripts, thumbnails) based on a ground truth dataset of videos that were targeted by raids. Then, we use an ensemble of classifiers to determine the likelihood that a video will be raided with very good results (AUC up to 94%). Overall, our work provides an important first step towards deploying proactive systems to detect and mitigate coordinated hate attacks on platforms like YouTube.

CRMar 9, 2018
A Family of Droids -- Android Malware Detection via Behavioral Modeling: Static vs Dynamic Analysis

Lucky Onwuzurike, Mario Almeida, Enrico Mariconti et al.

Following the increasing popularity of mobile ecosystems, cybercriminals have increasingly targeted them, designing and distributing malicious apps that steal information or cause harm to the device's owner. Aiming to counter them, detection techniques based on either static or dynamic analysis that model Android malware, have been proposed. While the pros and cons of these analysis techniques are known, they are usually compared in the context of their limitations e.g., static analysis is not able to capture runtime behaviors, full code coverage is usually not achieved during dynamic analysis, etc. Whereas, in this paper, we analyze the performance of static and dynamic analysis methods in the detection of Android malware and attempt to compare them in terms of their detection performance, using the same modeling approach. To this end, we build on MaMaDroid, a state-of-the-art detection system that relies on static analysis to create a behavioral model from the sequences of abstracted API calls. Then, aiming to apply the same technique in a dynamic analysis setting, we modify CHIMP, a platform recently proposed to crowdsource human inputs for app testing, in order to extract API calls' sequences from the traces produced while executing the app on a CHIMP virtual device. We call this system AuntieDroid and instantiate it by using both automated (Monkey) and user-generated inputs. We find that combining both static and dynamic analysis yields the best performance, with F-measure reaching 0.92. We also show that static analysis is at least as effective as dynamic analysis, depending on how apps are stimulated during execution, and, finally, investigate the reasons for inconsistent misclassifications across methods.

CLFeb 1, 2018
A Unified Deep Learning Architecture for Abuse Detection

Antigoni-Maria Founta, Despoina Chatzakou, Nicolas Kourtellis et al.

Hate speech, offensive language, sexism, racism and other types of abusive behavior have become a common phenomenon in many online social media platforms. In recent years, such diverse abusive behaviors have been manifesting with increased frequency and levels of intensity. This is due to the openness and willingness of popular media platforms, such as Twitter and Facebook, to host content of sensitive or controversial topics. However, these platforms have not adequately addressed the problem of online abusive behavior, and their responsiveness to the effective detection and blocking of such inappropriate behavior remains limited. In the present paper, we study this complex problem by following a more holistic approach, which considers the various aspects of abusive behavior. To make the approach tangible, we focus on Twitter data and analyze user and textual properties from different angles of abusive posting behavior. We propose a deep learning architecture, which utilizes a wide variety of available metadata, and combines it with automatically-extracted hidden patterns within the text of the tweets, to detect multiple abusive behavioral norms which are highly inter-related. We apply this unified architecture in a seamless, transparent fashion to detect different types of abusive behavior (hate speech, sexism vs. racism, bullying, sarcasm, etc.) without the need for any tuning of the model architecture for each task. We test the proposed approach with multiple datasets addressing different and multiple abusive behaviors on Twitter. Our results demonstrate that it largely outperforms the state-of-art methods (between 21 and 45\% improvement in AUC, depending on the dataset).

SIMay 9, 2017
Hate is not Binary: Studying Abusive Behavior of #GamerGate on Twitter

Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn et al.

Over the past few years, online bullying and aggression have become increasingly prominent, and manifested in many different forms on social media. However, there is little work analyzing the characteristics of abusive users and what distinguishes them from typical social media users. In this paper, we start addressing this gap by analyzing tweets containing a great large amount of abusiveness. We focus on a Twitter dataset revolving around the Gamergate controversy, which led to many incidents of cyberbullying and cyberaggression on various gaming and social media platforms. We study the properties of the users tweeting about Gamergate, the content they post, and the differences in their behavior compared to typical Twitter users. We find that while their tweets are often seemingly about aggressive and hateful subjects, "Gamergaters" do not exhibit common expressions of online anger, and in fact primarily differ from typical users in that their tweets are less joyful. They are also more engaged than typical Twitter users, which is an indication as to how and why this controversy is still ongoing. Surprisingly, we find that Gamergaters are less likely to be suspended by Twitter, thus we analyze their properties to identify differences from typical users and what may have led to their suspension. We perform an unsupervised machine learning analysis to detect clusters of users who, though currently active, could be considered for suspension since they exhibit similar behaviors with suspended users. Finally, we confirm the usefulness of our analyzed features by emulating the Twitter suspension mechanism with a supervised learning method, achieving very good precision and recall.

CYMar 4, 2017
I Would Not Plant Apple Trees If the World Will Be Wiped: Analyzing Hundreds of Millions of Behavioral Records of Players During an MMORPG Beta Test

Ah Reum Kang, Jeremy Blackburn, Haewoon Kwak et al.

In this work, we use player behavior during the closed beta test of the MMORPG ArcheAge as a proxy for an extreme situation: at the end of the closed beta test, all user data is deleted, and thus, the outcome (or penalty) of players' in-game behaviors in the last few days loses its meaning. We analyzed 270 million records of player behavior in the 4th closed beta test of ArcheAge. Our findings show that there are no apparent pandemic behavior changes, but some outliers were more likely to exhibit anti-social behavior (e.g., player killing). We also found that contrary to the reassuring adage that "Even if I knew the world would go to pieces tomorrow, I would still plant my apple tree," players abandoned character progression, showing a drastic decrease in quest completion, leveling, and ability changes at the end of the beta test.

SIFeb 24, 2017
Measuring #GamerGate: A Tale of Hate, Sexism, and Bullying

Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn et al.

Over the past few years, online aggression and abusive behaviors have occurred in many different forms and on a variety of platforms. In extreme cases, these incidents have evolved into hate, discrimination, and bullying, and even materialized into real-world threats and attacks against individuals or groups. In this paper, we study the Gamergate controversy. Started in August 2014 in the online gaming world, it quickly spread across various social networking platforms, ultimately leading to many incidents of cyberbullying and cyberaggression. We focus on Twitter, presenting a measurement study of a dataset of 340k unique users and 1.6M tweets to study the properties of these users, the content they post, and how they differ from random Twitter users. We find that users involved in this "Twitter war" tend to have more friends and followers, are generally more engaged and post tweets with negative sentiment, less joy, and more hate than random users. We also perform preliminary measurements on how the Twitter suspension mechanism deals with such abusive behaviors. While we focus on Gamergate, our methodology to collect and analyze tweets related to aggressive and bullying activities is of independent interest.

SIOct 11, 2016
Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan's Politically Incorrect Forum and Its Effects on the Web

Gabriel Emile Hine, Jeremiah Onaolapo, Emiliano De Cristofaro et al.

The discussion-board site 4chan has been part of the Internet's dark underbelly since its inception, and recent political events have put it increasingly in the spotlight. In particular, /pol/, the "Politically Incorrect" board, has been a central figure in the outlandish 2016 US election season, as it has often been linked to the alt-right movement and its rhetoric of hate and racism. However, 4chan remains relatively unstudied by the scientific community: little is known about its user base, the content it generates, and how it affects other parts of the Web. In this paper, we start addressing this gap by analyzing /pol/ along several axes, using a dataset of over 8M posts we collected over two and a half months. First, we perform a general characterization, showing that /pol/ users are well distributed around the world and that 4chan's unique features encourage fresh discussions. We also analyze content, finding, for instance, that YouTube links and hate speech are predominant on /pol/. Overall, our analysis not only provides the first measurement study of /pol/, but also insight into online harassment and hate speech trends in social media.

CYApr 9, 2015
Exploring Cyberbullying and Other Toxic Behavior in Team Competition Online Games

Haewoon Kwak, Jeremy Blackburn, Seungyeop Han

In this work we explore cyberbullying and other toxic behavior in team competition online games. Using a dataset of over 10 million player reports on 1.46 million toxic players along with corresponding crowdsourced decisions, we test several hypotheses drawn from theories explaining toxic behavior. Besides providing large-scale, empirical based understanding of toxic behavior, our work can be used as a basis for building systems to detect, prevent, and counter-act toxic behavior.