CVSep 18, 2023

Unsupervised Open-Vocabulary Object Localization in Videos

ETH Zurich
arXiv:2309.09858v214 citationsh-index: 137
Originality Incremental advance
AI Analysis

This addresses the problem of object localization in videos for computer vision researchers, offering an unsupervised approach that is novel but incremental in leveraging existing models.

The paper tackles unsupervised object localization in videos by combining slot attention with CLIP to assign text to localized objects, achieving good results on standard benchmarks without explicit supervision.

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes