Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding
This work addresses sarcasm understanding for natural language processing applications, but it is incremental as it extends existing multimodal evaluation to audio-visual-textual sarcasm.
The paper tackled sarcasm detection by evaluating large language models and multimodal LLMs on English and Chinese datasets, finding that audio-based models performed best unimodally and that text-audio or audio-vision combinations outperformed other setups, with MLLMs like Qwen-Omni showing competitive results.
Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.