CVJun 19, 2018

Modality Distillation with Multiple Stream Networks for Action Recognition

arXiv:1806.07110v2206 citationsHas Code
AI Analysis

This addresses the challenge of deploying robust models in real-life scenarios with missing modalities, though it is incremental as it builds on existing distillation frameworks.

The paper tackles the problem of multimodal video action recognition when only RGB data is available at test time, by proposing a hallucination network that distills depth features during training, achieving state-of-the-art results on the NTU RGB+D dataset.

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities could be available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D. Code available at https://github.com/ncgarcia/modality-distillation .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes