Sequential Late Fusion Technique for Multi-modal Sentiment Analysis
This work addresses sentiment analysis for users of multi-modal systems, but appears incremental as it builds on existing fusion methods without clear breakthrough claims.
The authors tackled multi-modal sentiment analysis by proposing a sequential late fusion technique using a multi-head attention LSTM network on text, audio, and visual data from the MOSI dataset, achieving unspecified classification performance results.
Multi-modal sentiment analysis plays an important role for providing better interactive experiences to users. Each modality in multi-modal data can provide different viewpoints or reveal unique aspects of a user's emotional state. In this work, we use text, audio and visual modalities from MOSI dataset and we propose a novel fusion technique using a multi-head attention LSTM network. Finally, we perform a classification task and evaluate its performance.