CVAIAug 30, 2021

Searching for Two-Stream Models in Multivariate Space for Video Recognition

arXiv:2108.12957v19 citations
Originality Highly original
AI Analysis

This work addresses the problem of sub-optimal and time-consuming manual architecture design for video recognition models, offering an automated solution that improves efficiency and performance for researchers and practitioners in computer vision.

The authors tackled the challenge of manually designing two-stream video recognition models by proposing a neural architecture search approach that efficiently explores a large design space, resulting in Auto-TSNet models that achieve the same accuracy as SlowFast with 11 times fewer FLOPS on Kinetics and at least 2% higher accuracy on Something-Something-V2.

Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models. Furthermore, we propose a progressive search procedure, by searching for the architecture of individual streams, fusion blocks, and attention blocks one after the other. We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space. Our searched two-stream models, namely Auto-TSNet, consistently outperform other models on standard benchmarks. On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over other methods which use less than 50 GFLOPS per video.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes