CVMar 25, 2023

Learning video embedding space with Natural Language Supervision

arXiv:2303.14584v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the gap in video-language alignment for tasks like video retrieval and classification, but it is incremental as it builds directly on CLIP.

The paper tackled the problem of extending CLIP's image-language embedding space to videos, achieving state-of-the-art performance on UCF101 and HMDB51 datasets.

The recent success of the CLIP model has shown its potential to be applied to a wide range of vision and language tasks. However this only establishes embedding space relationship of language to images, not to the video domain. In this paper, we propose a novel approach to map video embedding space to natural langugage. We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain, along with the corresponding text descriptions. We evaluate our method on two benchmark datasets, UCF101 and HMDB51, and achieve state-of-the-art performance on both tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes