FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

arXiv:2605.3134980.6

AI Analysis

This work addresses the problem of evaluating and improving the robustness of vision-language models for hateful meme detection, which is crucial for building more reliable content moderation systems. It highlights a significant generalization gap in current models.

This paper introduces FBHM, a new benchmark for hateful meme detection with 5,000 memes across 25 rhetorical functionalities and 10 target communities. State-of-the-art VLMs, despite high accuracy on existing datasets, show a catastrophic drop to near-random performance on FBHM, indicating a generalization gap. The proposed LSV strategy, using only 500 steering samples, improves FBHM performance by approximately 30 Macro-F1 points.

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

View on arXiv PDF

Similar