CL MMSep 18, 2025

Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

arXiv:2509.15476v18.33 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses sarcasm understanding for natural language processing applications, but it is incremental as it extends existing multimodal evaluation to audio-visual-textual sarcasm.

The paper tackled sarcasm detection by evaluating large language models and multimodal LLMs on English and Chinese datasets, finding that audio-based models performed best unimodally and that text-audio or audio-vision combinations outperformed other setups, with MLLMs like Qwen-Omni showing competitive results.

Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

View on arXiv PDF

Similar