LGOct 13, 2025

Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

arXiv:2510.11852v12 citationsh-index: 18Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding complex multimodal phenomena like sarcasm for researchers and developers in AI, but it is incremental as it focuses on benchmarking existing models rather than introducing new methods.

The study evaluated seven open-source vision-language models on multimodal sarcasm detection using zero-, one-, and few-shot prompting across three benchmark datasets, finding that while models achieved moderate success in binary detection, they struggled to generate high-quality explanations without task-specific finetuning.

Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes