LG MMFeb 18, 2022

A Review on Methods and Applications in Multimodal Deep Learning

Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul

arXiv:2202.09195v117.3181 citations

Originality Synthesis-oriented

AI Analysis

It provides a comprehensive survey for researchers in AI and machine learning, but it is incremental as it reviews existing work without introducing new methods or results.

This paper reviews multimodal deep learning methods and applications from 2017 to 2021, analyzing baseline approaches and recent advancements across modalities like image, text, and audio, and proposes a taxonomy while highlighting domain-specific issues and future directions.

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.

View on arXiv PDF

Similar