LG MMNov 3, 2020

Robust Latent Representations via Cross-Modal Translation and Alignment

Vandana Rajan, Alessio Brutti, Andrea Cavallaro

arXiv:2011.01631v25.813 citations

Originality Incremental advance

AI Analysis

This addresses a limitation in multi-modal learning for scenarios like emotion recognition where some modalities may be missing or noisy during testing, though it is incremental as it builds on existing multi-modal methods.

The paper tackles the problem of improving uni-modal testing performance when only some modalities are available during training, using cross-modal translation and latent space alignment, achieving state-of-the-art results on the AVEC 2016 dataset for continuous emotion recognition.

Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when the signals from some modalities are unavailable or are severely degraded by noise. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of the weaker modalities. The translation from the weaker to the stronger modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representations in a shared latent space. We validate the proposed approach on the AVEC 2016 dataset for continuous emotion recognition and show the effectiveness of the approach that achieves state-of-the-art (uni-modal) performance for weaker modalities.

View on arXiv PDF

Similar