MMOct 31, 2018

Semantic Modeling of Textual Relationships in Cross-Modal Retrieval

Jing Yu, Chenghao Yang, Zengchang Qin, Zhuoqian Yang, Yue Hu, Weifeng Zhang

arXiv:1810.13151v36.66 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving cross-modal retrieval accuracy for applications like multimedia search, representing an incremental advance over existing methods.

The paper tackles the problem of semantic modeling of textual relationships in cross-modal retrieval by proposing a method that integrates multi-view textual relationships into a featured graph and uses a dual-path neural network to learn representations and similarity measures, achieving state-of-the-art accuracy improvements of 3.4% and 6.3% on two benchmark datasets.

Feature modeling of different modalities is a basic problem in current research of cross-modal information retrieval. Existing models typically project texts and images into one embedding space, in which semantically similar information will have a shorter distance. Semantic modeling of textural relationships is notoriously difficult. In this paper, we propose an approach to model texts using a featured graph by integrating multi-view textual relationships including semantic relations, statistical co-occurrence, and prior relations in the knowledge base. A dual-path neural network is adopted to learn multi-modal representations of information and cross-modal similarity measure jointly. We use a Graph Convolutional Network (GCN) for generating relation-aware text representations, and use a Convolutional Neural Network (CNN) with non-linearities for image representations. The cross-modal similarity measure is learned by distance metric learning. Experimental results show that, by leveraging the rich relational semantics in texts, our model can outperform the state-of-the-art models by 3.4% and 6.3% on accuracy on two benchmark datasets.

View on arXiv PDF Code

Similar