LGAICLAug 31, 2021

Improving Multimodal fusion via Mutual Dependency Maximisation

arXiv:2109.00922v2670 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of effectively integrating visual, acoustic, and linguistic modalities for sentiment analysis, offering incremental improvements over existing methods.

The paper tackles the problem of multimodal fusion in sentiment analysis by proposing new objectives that measure dependency between modalities, resulting in up to 4.3% accuracy improvement on state-of-the-art models across two datasets and producing more robust representations.

Multimodal sentiment analysis is a trending area of research, and the multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as $L_1$ or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to $4.3$ on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: \texttt{CMU-MOSI} and \texttt{CMU-MOSEI}. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes