CVOct 16, 2018

Cross-Modal and Hierarchical Modeling of Video and Text

arXiv:1810.07212v125.4208 citationsh-index: 63Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of cross-modal retrieval and understanding for video and text data, representing an incremental improvement in hierarchical modeling techniques.

The paper tackles the problem of modeling hierarchical sequential data across modalities by introducing hierarchical sequence embedding (HSE), which embeds video and text into semantic spaces and demonstrates superior performance on retrieval tasks, with applications in zero-shot action recognition and video captioning.

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

View on arXiv PDF Code

Similar