CVAILGSDASJan 16, 2023

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

CMU
arXiv:2301.06267v5168 citationsh-index: 34
Originality Incremental advance
AI Analysis

This work addresses the problem of limited data in few-shot learning for AI systems by leveraging multimodal data, offering a simple yet effective method that improves performance across vision and audio tasks.

The paper tackles few-shot learning by using cross-modal information from multimodal models to enhance visual classification, achieving state-of-the-art results with simple linear classifiers and creating a new audiovisual benchmark.

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes