CVJun 20, 2022

M&M Mix: A Multimodal Multiview Transformer Ensemble

arXiv:2206.09852v123 citationsh-index: 151
Originality Synthesis-oriented
AI Analysis

This work addresses action recognition in videos for computer vision applications, but it is incremental as it builds upon existing methods.

The authors tackled the Epic-Kitchens Action Recognition Challenge by adapting a Multiview Transformer to multimodal inputs and using an ensemble of models, achieving 52.8% Top-1 accuracy, a 4.1% improvement over the previous winner.

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes