MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts
This addresses the problem of limited interaction modeling in multimodal AI for applications like social media analysis, though it is incremental as it builds on existing mixture-of-experts frameworks.
The paper tackles the challenge of modeling diverse multimodal interactions beyond image-text correspondence, such as sarcasm and humor, by introducing Multimodal Mixtures of Experts (MMoE), which trains separate experts for different interaction types and achieves new state-of-the-art results on sarcasm detection (MUStARD) and humor detection (URFUNNY) tasks.
Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.