CVAug 3, 2022

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

arXiv:2208.02080v132 citationsh-index: 29Has Code
Originality Incremental advance
AI Analysis

This addresses the issue of resource demands and copyright limitations in video retrieval for social media and content platforms, though it is incremental as it builds on existing data augmentation methods.

The paper tackles the problem of data augmentation for text-video retrieval by proposing a feature-space multimodal technique that mixes semantically similar samples, achieving improved state-of-the-art performance on the EPIC-Kitchens-100 dataset.

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes