CVOct 16, 2018

Cross-Modal and Hierarchical Modeling of Video and Text

arXiv:1810.07212v1208 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-modal retrieval and understanding for video and text data, representing an incremental improvement in hierarchical modeling techniques.

The paper tackles the problem of modeling hierarchical sequential data across modalities by introducing hierarchical sequence embedding (HSE), which embeds video and text into semantic spaces and demonstrates superior performance on retrieval tasks, with applications in zero-shot action recognition and video captioning.

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes