CVCLSep 27, 2023

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

arXiv:2309.15494v112 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses efficiency and robustness issues in multimodal AI systems for applications like sentiment analysis and retrieval, though it is incremental as it builds on existing distillation and multimodal techniques.

The paper tackles the problem of multimodal transfer learning where missing modalities degrade performance and extracting all modalities is inefficient, by proposing VideoAdviser, a video knowledge distillation method that transfers multimodal knowledge from a teacher to a student model. The result shows the student model, using only text input, achieves up to 12.3% MAE improvement on sentiment analysis datasets and enhances state-of-the-art by 3.4% mAP on a retrieval dataset without extra inference computations.

Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive multimodal knowledge supervision signals to a RoBERTa-based student model via optimizing a step-distillation objective loss -- first step: the teacher distills multimodal knowledge of video-enhanced prompts from classification logits to a regression logit -- second step: the multimodal knowledge is distilled from the regression logit of the teacher to the student. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and audio-visual retrieval (VEGAS dataset). The student (requiring only the text modality as input) achieves an MAE score improvement of up to 12.3% for MOSI and MOSEI. Our method further enhances the state-of-the-art method by 3.4% mAP score for VEGAS without additional computations for inference. These results suggest the strengths of our method for achieving high efficiency-performance multimodal transfer learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes