CVApr 27, 2023Code
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal BiasStefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation (MM) and the urgent need for effective strategies to detect and prevent its spread. In recent years, the challenge of multimodal misinformation detection (MMD) has garnered significant attention by researchers and has mainly involved the creation of annotated, weakly annotated, or synthetically generated training datasets, along with the development of various deep learning MMD models. However, the problem of unimodal bias has been overlooked, where specific patterns and biases in MMD benchmarks can result in biased or unimodal models outperforming their multimodal counterparts on an inherently multimodal task; making it difficult to assess progress. In this study, we systematically investigate and identify the presence of unimodal bias in widely-used MMD benchmarks, namely VMU-Twitter and COSMOS. To address this issue, we introduce the "VERification of Image-TExt pairs" (VERITE) benchmark for MMD which incorporates real-world data, excludes "asymmetric multimodal misinformation" and utilizes "modality balancing". We conduct an extensive comparative study with a Transformer-based architecture that shows the ability of VERITE to effectively address unimodal bias, rendering it a robust evaluation framework for MMD. Furthermore, we introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data that preserve crossmodal relations between legitimate images and false human-written captions. By leveraging CHASMA in the training process, we observe consistent and notable improvements in predictive performance on VERITE; with a 9.2% increase in accuracy. We release our code at: https://github.com/stevejpapad/image-text-verification
CVJul 27, 2022Code
VICTOR: Visual Incompatibility Detection with Transformers and Fashion-specific contrastive pre-trainingStefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
For fashion outfits to be considered aesthetically pleasing, the garments that constitute them need to be compatible in terms of visual aspects, such as style, category and color. Previous works have defined visual compatibility as a binary classification task with items in a garment being considered as fully compatible or fully incompatible. However, this is not applicable to Outfit Maker applications where users create their own outfits and need to know which specific items may be incompatible with the rest of the outfit. To address this, we propose the Visual InCompatibility TransfORmer (VICTOR) that is optimized for two tasks: 1) overall compatibility as regression and 2) the detection of mismatching items and utilize fashion-specific contrastive language-image pre-training for fine tuning computer vision neural networks on fashion imagery. We build upon the Polyvore outfit benchmark to generate partially mismatching outfits, creating a new dataset termed Polyvore-MISFITs, that is used to train VICTOR. A series of ablation and comparative analyses show that the proposed architecture can compete and even surpass the current state-of-the-art on Polyvore datasets while reducing the instance-wise floating operations by 88%, striking a balance between high performance and efficiency. We release our code at https://github.com/stevejpapad/Visual-InCompatibility-Transformer
MMNov 16, 2023Code
RED-DOT: Multimodal Fact-checking via Relevant Evidence DetectionStefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all external information collected from the web to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art (SotA) on the VERITE benchmark by up to 33.7%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT surpassing the SotA on NewsCLIPings+ by up to 3% without the need for numerous evidence or multiple backbone encoders. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection
CVJul 18, 2024Code
Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection?Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
Out-of-context (OOC) misinformation poses a significant challenge in multimodal fact-checking, where images are paired with texts that misrepresent their original context to support false narratives. Recent research in evidence-based OOC detection has seen a trend towards increasingly complex architectures, incorporating Transformers, foundation models, and large language models. In this study, we introduce a simple yet robust baseline, which assesses MUltimodal SimilaritiEs (MUSE), specifically the similarity between image-text pairs and external image and text evidence. Our results demonstrate that MUSE, when used with conventional classifiers like Decision Tree, Random Forest, and Multilayer Perceptron, can compete with and even surpass the state-of-the-art on the NewsCLIPpings and VERITE datasets. Furthermore, integrating MUSE in our proposed "Attentive Intermediate Transformer Representations" (AITR) significantly improved performance, by 3.3% and 7.5% on NewsCLIPpings and VERITE, respectively. Nevertheless, the success of MUSE, relying on surface-level patterns and shortcuts, without examining factuality and logical inconsistencies, raises critical questions about how we define the task, construct datasets, collect external evidence and overall, how we assess progress in the field. We release our code at: https://github.com/stevejpapad/outcontext-misinfo-progress
MMMar 2, 2023
Synthetic Misinformers: Generating and Combating Multimodal MisinformationStefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation - which we refer to as Synthetic Misinformers - in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) cross-modal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where text-only MMDs can outperform multimodal ones.
CVDec 14, 2022
Design-time Fashion Popularity Forecasting in VR EnvironmentsStefanos-Iordanis Papadopoulos, Christos Koutlis, Anastasios Papazoglou-Chalikias et al.
Being able to forecast the popularity of new garment designs is very important in an industry as fast paced as fashion, both in terms of profitability and reducing the problem of unsold inventory. Here, we attempt to address this task in order to provide informative forecasts to fashion designers within a virtual reality designer application that will allow them to fine tune their creations based on current consumer preferences within an interactive and immersive environment. To achieve this we have to deal with the following central challenges: (1) the proposed method should not hinder the creative process and thus it has to rely only on the garment's visual characteristics, (2) the new garment lacks historical data from which to extrapolate their future popularity and (3) fashion trends in general are highly dynamical. To this end, we develop a computer vision pipeline fine tuned on fashion imagery in order to extract relevant visual features along with the category and attributes of the garment. We propose a hierarchical label sharing (HLS) pipeline for automatically capturing hierarchical relations among fashion categories and attributes. Moreover, we propose MuQAR, a Multimodal Quasi-AutoRegressive neural network that forecasts the popularity of new garments by combining their visual features and categorical features while an autoregressive neural network is modelling the popularity time series of the garment's category and attributes. Both the proposed HLS and MuQAR prove capable of surpassing the current state-of-the-art in key benchmark datasets, DeepFashion for image classification and VISUELLE for new garment sales forecasting.
55.2CRApr 15
The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal MisinformationZacharias Chrysidis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos
As generative AI advances, the distinction between authentic and synthetic media is increasingly blurred, challenging the integrity of online information. In this study, we present CONVEX, a large-scale dataset of multimodal misinformation involving miscaptioned, edited, and AI-generated visual content, comprising over 150K multimodal posts with associated notes and engagement metrics from X's Community Notes. We analyze how multimodal misinformation evolves in terms of virality, engagement, and consensus dynamics, with a focus on synthetic media. Our results show that while AI-generated content achieves disproportionate virality, its spread is driven primarily by passive engagement rather than active discourse. Despite slower initial reporting, AI-generated content reaches community consensus more quickly once flagged. Moreover, our evaluation of specialized detectors and vision-language models reveals a consistent decline in performance over time in distinguishing synthetic from authentic images as generative models evolve. These findings highlight the need for continuous monitoring and adaptive strategies in the rapidly evolving digital information environment.
CVApr 8, 2025Code
Latent Multimodal Reconstruction for Misinformation DetectionStefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos et al.
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have focused on developing datasets and methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic examples that lack real-world complexity, limiting model robustness. Meanwhile, Large Vision-Language Models (LVLMs) remain underexplored for generating diverse and realistic synthetic data for MMD. To address, we introduce "Miscaption This!", a collection of LVLM-generated miscaptioned image datasets. Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" generalize better to real-world misinformation while LAMAR achieves new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the value of LVLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
CLApr 29, 2024
Credible, Unreliable or Leaked?: Evidence Verification for Enhanced Automated Fact-checkingZacharias Chrysidis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos et al.
Automated fact-checking (AFC) is garnering increasing attention by researchers aiming to help fact-checkers combat the increasing spread of misinformation online. While many existing AFC methods incorporate external information from the Web to help examine the veracity of claims, they often overlook the importance of verifying the source and quality of collected "evidence". One overlooked challenge involves the reliance on "leaked evidence", information gathered directly from fact-checking websites and used to train AFC systems, resulting in an unrealistic setting for early misinformation detection. Similarly, the inclusion of information from unreliable sources can undermine the effectiveness of AFC systems. To address these challenges, we present a comprehensive approach to evidence verification and filtering. We create the "CREDible, Unreliable or LEaked" (CREDULE) dataset, which consists of 91,632 articles classified as Credible, Unreliable and Fact checked (Leaked). Additionally, we introduce the EVidence VERification Network (EVVER-Net), trained on CREDULE to detect leaked and unreliable evidence in both short and long texts. EVVER-Net can be used to filter evidence collected from the Web, thus enhancing the robustness of end-to-end AFC systems. We experiment with various language models and show that EVVER-Net can demonstrate impressive performance of up to 91.5% and 94.4% accuracy, while leveraging domain credibility scores along with short or long texts, respectively. Finally, we assess the evidence provided by widely-used fact-checking datasets including LIAR-PLUS, MOCHEG, FACTIFY, NewsCLIPpings+ and VERITE, some of which exhibit concerning rates of leaked and unreliable evidence.
CVAug 28, 2025
"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image DetectionAnastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos et al.
Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.