IR MMAug 21, 2019

Learning Joint Embedding for Cross-Modal Retrieval

arXiv:1908.07673v16.68 citations

Originality Synthesis-oriented

AI Analysis

This work addresses cross-modal retrieval challenges for multimedia data mining, but it appears incremental as it builds on existing correlation learning methods.

The paper tackles the problem of cross-modal retrieval by addressing the gap in temporal structures between different data modalities, proposing a triplet neural network-based supervised correlation learning architecture that achieves the best results when using supervised learning for data representation.

A cross-modal retrieval process is to use a query in one modality to obtain relevant data in another modality. The challenging issue of cross-modal retrieval lies in bridging the heterogeneous gap for similarity computation, which has been broadly discussed in image-text, audio-text, and video-text cross-modal multimedia data mining and retrieval. However, the gap in temporal structures of different data modalities is not well addressed due to the lack of alignment relationship between temporal cross-modal structures. Our research focuses on learning the correlation between different modalities for the task of cross-modal retrieval. We have proposed an architecture: Supervised-Deep Canonical Correlation Analysis (S-DCCA), for cross-modal retrieval. In this forum paper, we will talk about how to exploit triplet neural networks (TNN) to enhance the correlation learning for cross-modal retrieval. The experimental result shows the proposed TNN-based supervised correlation learning architecture can get the best result when the data representation extracted by supervised learning.

View on arXiv PDF

Similar