CVAICLLGNov 1, 2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

arXiv:2011.00597v1180 citationsHas Code
AI Analysis

This work addresses video-text tasks for applications like retrieval and captioning, offering an incremental improvement by leveraging hierarchical information.

The paper tackles the problem of video-text representation learning by proposing COOT, a cooperative hierarchical Transformer that models interactions across different granularity levels and modalities, achieving state-of-the-art results on several benchmarks with few parameters.

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes