SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak, Matt Coler

arXiv:2603.05275v11.2h-index: 13

Originality Incremental advance

AI Analysis

This work provides an incremental improvement in multimodal sarcasm detection for researchers and practitioners working with foundation models, by enhancing reasoning and reducing hallucination.

This paper addresses multimodal sarcasm detection by proposing SarcasmMiner, a reinforcement learning framework that improves reasoning and resists hallucination in foundation models. It achieves an F1 score of 70.22% on MUStARD++, an increase from 59.83% (zero-shot) and 68.23% (supervised finetuning).

Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.

View on arXiv PDF

Similar