CVNov 17, 2022

Exploring adaptation of VideoMAE for Audio-Visual Diarization & Social @ Ego4d Looking at me Challenge

arXiv:2211.16206v1h-index: 27
Originality Synthesis-oriented
AI Analysis

This work addresses audio-visual diarization in egocentric video analysis, but it is incremental as it applies an existing method to a new dataset.

The authors adapted VideoMAE, a pretrained video mask autoencoder, for the Ego4D Looking at Me Challenge, achieving better results than the baseline by training for only 10 epochs on egocentric data.

In this report, we present the transferring pretrained video mask autoencoders(VideoMAE) to egocentric tasks for Ego4d Looking at me Challenge. VideoMAE is the data-efficient pretraining model for self-supervised video pre-training and can easily transfer to downstream tasks. We show that the representation transferred from VideoMAE has good Spatio-temporal modeling and the ability to capture small actions. We only need to use egocentric data to train 10 epochs based on VideoMAE which pretrained by the ordinary videos acquired from a third person's view, and we can get better results than the baseline on Ego4d Looking at me Challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes