CLLGMar 5, 2024

JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

arXiv:2403.04798v227 citationsh-index: 10SemEval
AI Analysis

This work addresses the challenge of integrating text, audio, and video for emotion cause analysis in conversations, but it is incremental as it builds on existing models and methods.

The paper tackled multimodal emotion cause analysis in conversations by proposing a two-step framework using instruction-tuned Llama models and GPT-4V with in-context learning, achieving rank 4 in the SemEval-2024 competition and showing significant performance gains in ablation experiments.

This paper presents our system development for SemEval-2024 Task 3: "The Competition of Multimodal Emotion Cause Analysis in Conversations". Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video. However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system. Our proposed approach addresses these challenges by a two-step framework. We adopt two different approaches in our implementation. In Approach 1, we employ instruction-tuning with two separate Llama 2 models for emotion and cause prediction. In Approach 2, we use GPT-4V for conversation-level video description and employ in-context learning with annotated conversation using GPT 3.5. Our system wins rank 4, and system ablation experiments demonstrate that our proposed solutions achieve significant performance gains. All the experimental codes are available on Github.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes