CLAIFeb 11, 2023

HateProof: Are Hateful Meme Detection Systems really Robust?

arXiv:2302.05703v115 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the problem of improving robustness in hateful meme detection for social media moderation, though it is incremental as it builds on existing adversarial training methods.

The paper analyzed vulnerabilities of hateful meme detection systems to simple human-crafted adversarial attacks, finding performance drops up to 10% in macro-F1 scores, and proposed an ensemble method using contrastive learning and VILLA to partially restore robustness.

Exploiting social media to spread hate has tremendously increased over the years. Lately, multi-modal hateful content such as memes has drawn relatively more traction than uni-modal content. Moreover, the availability of implicit content payloads makes them fairly challenging to be detected by existing hateful meme detection systems. In this paper, we present a use case study to analyze such systems' vulnerabilities against external adversarial attacks. We find that even very simple perturbations in uni-modal and multi-modal settings performed by humans with little knowledge about the model can make the existing detection models highly vulnerable. Empirically, we find a noticeable performance drop of as high as 10% in the macro-F1 score for certain attacks. As a remedy, we attempt to boost the model's robustness using contrastive learning as well as an adversarial training-based method - VILLA. Using an ensemble of the above two approaches, in two of our high resolution datasets, we are able to (re)gain back the performance to a large extent for certain attacks. We believe that ours is a first step toward addressing this crucial problem in an adversarial setting and would inspire more such investigations in the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes