CVAug 9, 2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

arXiv:1908.03477v10.00187 citations
AI Analysis50

This work addresses fine-grained action retrieval for video-text applications, offering incremental improvements through specialized embedding spaces.

The paper tackles cross-modal fine-grained action retrieval between text and video by enriching embeddings through disentangling parts-of-speech in captions, reporting improved results on the EPIC dataset in a zero-shot setting and benefits on MSR-VTT for generic retrieval.

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes