CL CVAug 14, 2019

Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling

Yi-Ting Yeh, Tzu-Chuan Lin, Hsiao-Hua Cheng, Yu-Hsuan Deng, Shang-Yu Su, Yun-Nung Chen

arXiv:1908.05067v11.616 citations

Originality Synthesis-oriented

AI Analysis

This work addresses multimodal dialogue modeling for AI systems, but appears incremental as it builds on existing methods and applies them to a new task.

The paper tackles the challenging audio visual scene-aware dialogue (AVSD) task by proposing a multi-stage feature fusion mechanism to integrate multimodal features, achieving improved performance as demonstrated in experiments.

Visual question answering and visual dialogue tasks have been increasingly studied in the multimodal field towards more practical real-world scenarios. A more challenging task, audio visual scene-aware dialogue (AVSD), is proposed to further advance the technologies that connect audio, vision, and language, which introduces temporal video information and dialogue interactions between a questioner and an answerer. This paper proposes an intuitive mechanism that fuses features and attention in multiple stages in order to well integrate multimodal features, and the results demonstrate its capability in the experiments. Also, we apply several state-of-the-art models in other tasks to the AVSD task, and further analyze their generalization across different tasks.

View on arXiv PDF

Similar