CLJun 11, 2024
Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme ToxicitiesDelfina Sol Martinez Pandiani, Erik Tjong Kim Sang, Davide Ceolin
Internet memes, channels for humor, social commentary, and cultural expression, are increasingly used to spread toxic messages. Studies on the computational analyses of toxic memes have significantly grown over the past five years, and the only three surveys on computational toxic meme analysis cover only work published until 2022, leading to inconsistent terminology and unexplored trends. Our work fills this gap by surveying content-based computational perspectives on toxic memes, and reviewing key developments until early 2024. Employing the PRISMA methodology, we systematically extend the previously considered papers, achieving a threefold result. First, we survey 119 new papers, analyzing 158 computational works focused on content-based toxic meme analysis. We identify over 30 datasets used in toxic meme analysis and examine their labeling systems. Second, after observing the existence of unclear definitions of meme toxicity in computational works, we introduce a new taxonomy for categorizing meme toxicity types. We also note an expansion in computational tasks beyond the simple binary classification of memes as toxic or non-toxic, indicating a shift towards achieving a nuanced comprehension of toxicity. Third, we identify three content-based dimensions of meme toxicity under automatic study: target, intent, and conveyance tactics. We develop a framework illustrating the relationships between these dimensions and meme toxicities. The survey analyzes key challenges and recent trends, such as enhanced cross-modal reasoning, integrating expert and cultural knowledge, the demand for automatic toxicity explanations, and handling meme toxicity in low-resource languages. Also, it notes the rising use of Large Language Models (LLMs) and generative AI for detecting and generating toxic memes. Finally, it proposes pathways for advancing toxic meme detection and interpretation.
CLDec 19, 2023
REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname CuracaoErik Tjong Kim Sang
We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.
SIJun 12, 2020
Dutch General Public Reaction on Governmental COVID-19 Measures and Announcements in Twitter DataShihan Wang, Marijn Schraagen, Erik Tjong Kim Sang et al.
Public sentiment (the opinions, attitudes or feelings expressed by the public) is a factor of interest for government, as it directly influences the implementation of policies. Given the unprecedented nature of the COVID-19 crisis, having an up-to-date representation of public sentiment on governmental measures and announcements is crucial. While the 'staying-at-home' policy makes face-to-face interactions and interviews challenging, analysing real-time Twitter data that reflects public opinion toward policy measures is a cost-effective way to access public sentiment. In this context, we collect streaming data using the Twitter API starting from the COVID-19 outbreak in the Netherlands in February 2020, and track Dutch general public reactions on governmental measures and announcements. We provide temporal analysis of tweet frequency and public sentiment over the past seven months. We also identify public attitudes towards two Dutch policies in case studies: one regarding social distancing and one regarding wearing face masks. By presenting those preliminary results, we aim to provide visibility into the social media discussions around COVID-19 to the general public, scientists and policy makers. The data collection and analysis will be updated and expanded over time.
CLOct 1, 2018
Utilizing a Transparency-driven Environment toward Trusted Automatic Genre Classification: A Case Study in Journalism HistoryAysenur Bilgin, Laura Hollink, Jacco van Ossenbruggen et al.
With the growing abundance of unlabeled data in real-world tasks, researchers have to rely on the predictions given by black-boxed computational models. However, it is an often neglected fact that these models may be scoring high on accuracy for the wrong reasons. In this paper, we present a practical impact analysis of enabling model transparency by various presentation forms. For this purpose, we developed an environment that empowers non-computer scientists to become practicing data scientists in their own research field. We demonstrate the gradually increasing understanding of journalism historians through a real-world use case study on automatic genre classification of newspaper articles. This study is a first step towards trusted usage of machine learning pipelines in a responsible way.