CVMMApr 16, 2023

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

arXiv:2304.07775v28 citationsh-index: 52Has Code
Originality Incremental advance
AI Analysis

This work addresses cross-modal distillation for unconstrained videos, which is incremental as it builds on existing synchronization-based methods by handling noise and semantic inconsistencies.

The paper tackles the problem of cross-modal knowledge distillation in unconstrained videos, where irrelevant noise and inconsistent semantics hinder performance, by proposing a Modality Noise Filter and Contrastive Semantic Calibration to improve distillation, resulting in performance boosts in visual action recognition and video retrieval tasks.

Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes