Deep Learning for Video-Text Retrieval: a Review
It provides a comprehensive overview for researchers in video-text retrieval, but it is incremental as it synthesizes existing work without introducing new methods.
This survey reviews over 100 research papers on video-text retrieval, summarizing state-of-the-art performance on benchmark datasets and discussing challenges like learning spatial-temporal video features and narrowing the cross-modal gap.
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.