CLDec 11, 2024

TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction

arXiv:2412.08529v12 citationsh-index: 5PACLIC
Originality Incremental advance
AI Analysis

This work addresses challenges in multimodal intent recognition for dialogue systems, but it appears incremental as it builds on existing methods with enhancements.

The paper tackled the problem of multimodal intent recognition by proposing TECO, a method that enhances text with commonsense knowledge and aligns it with other modalities, resulting in substantial improvements over baseline methods.

The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes