M&M Mix: A Multimodal Multiview Transformer Ensemble
This work addresses action recognition in videos for computer vision applications, but it is incremental as it builds upon existing methods.
The authors tackled the Epic-Kitchens Action Recognition Challenge by adapting a Multiview Transformer to multimodal inputs and using an ensemble of models, achieving 52.8% Top-1 accuracy, a 4.1% improvement over the previous winner.
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.