CVLGOct 14, 2024

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

arXiv:2410.10663v21 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the limitation of unimodal few-shot learning for real-world multi-modal data, offering a novel approach to improve generalization with scarce labeled examples across modalities.

The paper tackles the problem of few-shot learning in multi-modal settings by introducing a Cross-modal Few-Shot Learning task and proposing a Generative Transfer Learning framework, which achieves state-of-the-art performance on seven multi-modal datasets.

Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize to unseen data using a limited amount of labeled examples from a single modality. However, real-world data are inherently multi-modal, and such unimodal approaches limit the practical applications of few-shot learning. To bridge this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances across multiple modalities while relying on scarce labeled data. This task presents unique challenges compared to classical few-shot learning arising from the distinct visual attributes and structural disparities inherent to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework by simulating how humans abstract and generalize concepts. Specifically, the GTL jointly estimates the latent shared concept across modalities and the in-modality disturbance through a generative structure. Establishing the relationship between latent concepts and visual content among abundant unimodal data enables GTL to effectively transfer knowledge from unimodal to novel multimodal data, as humans did. Comprehensive experiments demonstrate that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes