AICLMay 2

MemeLens: Multilingual Multitask VLMs for Memes

arXiv:2601.1253994.61 citationsh-index: 48Has Code
AI Analysis

This work provides a comprehensive benchmark and model for the fragmented field of meme analysis, enabling cross-domain generalization across tasks and languages.

The authors propose MemeLens, a unified multilingual and multitask VLM for meme understanding, consolidating 38 public datasets into 20 tasks. Their results show that multimodal training is essential, performance varies across semantic categories, and unified training avoids over-specialization.

Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate $38$ public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We make the experimental resources (https://github.com/MohamedBayan/MemeLens), model (https://huggingface.co/QCRI/MemeLens-VLM) and datasets (https://huggingface.co/datasets/QCRI/MemeLens) publicly available to the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes