Rethinking matching-based few-shot action recognition
This work addresses the problem of recognizing new action classes with limited examples for computer vision researchers, offering an incremental improvement by simplifying the matching process.
The paper tackles few-shot action recognition by evaluating matching-based approaches with spatio-temporal features, showing that simple baselines close the performance gap with complex methods, and proposes Chamfer++, a non-temporal matching function that achieves state-of-the-art results on three datasets.
Few-shot action recognition, i.e. recognizing new action classes given only a few examples, benefits from incorporating temporal information. Prior work either encodes such information in the representation itself and learns classifiers at test time, or obtains frame-level features and performs pairwise temporal matching. We first evaluate a number of matching-based approaches using features from spatio-temporal backbones, a comparison missing from the literature, and show that the gap in performance between simple baselines and more complicated methods is significantly reduced. Inspired by this, we propose Chamfer++, a non-temporal matching function that achieves state-of-the-art results in few-shot action recognition. We show that, when starting from temporal features, our parameter-free and interpretable approach can outperform all other matching-based and classifier methods for one-shot action recognition on three common datasets without using temporal information in the matching stage. Project page: https://jbertrand89.github.io/matching-based-fsar