CVDec 10, 2020

Interactive Fusion of Multi-level Features for Compositional Activity Recognition

arXiv:2012.05689v119 citations
AI Analysis

This work provides an incremental improvement in compositional activity recognition for computer vision researchers, specifically tackling the fusion of diverse feature types.

This paper addresses the challenge of fusing multi-modal and multi-dimensional features (appearance, positional, semantic) for complex action recognition. The proposed interactive fusion framework, which projects features across different spaces and guides the fusion with an auxiliary prediction task, achieved a 2.9% gain in top-1 accuracy on the Something-Else dataset.

To understand a complex action, multiple sources of information, including appearance, positional, and semantic features, need to be integrated. However, these features are difficult to be fused since they often differ significantly in modality and dimensionality. In this paper, we present a novel framework that accomplishes this goal by interactive fusion, namely, projecting features across different spaces and guiding it using an auxiliary prediction task. Specifically, we implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction. We evaluate our approach on two action recognition datasets, Something-Something and Charades. Interactive fusion achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms. In particular, on Something-Else, the compositional setting of Something-Something, interactive fusion reports a remarkable gain of 2.9% in terms of top-1 accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes