IRDec 5, 2021

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, Wei Li

arXiv:2112.02601v16.322 citations

Originality Incremental advance

AI Analysis

This addresses the problem of measuring similarity between different modalities for researchers in information retrieval and machine learning, representing an incremental improvement over existing subspace learning approaches.

The paper tackles cross-modal retrieval between audio and visual data by proposing a variational autoencoder architecture with canonical correlation analysis constraints to learn joint embeddings, achieving results that are appreciably better than existing methods on two benchmark datasets.

Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and database. How to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several reasearch works have calculated the correlation between different modality data via learning a common subspace representation, the encoder's ability to extract features from multi-modal information is not satisfactory. In this paper, we present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces. Further, two mutual latent spaces are respectively constructed by canonical correlation analysis (CCA). On the other hand, probabilistic modeling methods is used to deal with possible noise and missing information in the data. Additionally, in this way, the cross-modal discrepancy from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. We conduct extensive experiments over two benchmark datasets. The experimental outcomes exhibit that the proposed architecture is effective in learning audio-visual correlation and is appreciably better than the existing cross-modal retrieval methods.

View on arXiv PDF

Similar