LGMay 13, 2022

Multimodal Conversational AI: A Survey of Datasets and Approaches

arXiv:2205.06907v1641 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

It addresses the problem of developing AI systems that can understand and express multiple modalities for more human-like conversations, but it is incremental as a survey paper.

This paper surveys the field of multimodal conversational AI, defining its research objective and providing a taxonomy of key areas such as representation and fusion, while identifying multimodal co-learning as a promising direction for future work.

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes