CVROJul 6, 2023

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

arXiv:2307.03073v325 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses few-shot learning challenges for vision-language tasks, offering an incremental improvement by adapting existing prototypical networks to multimodal data.

The authors tackled few-shot learning by proposing Proto-CLIP, a framework that leverages CLIP's vision-language model to compute and align image and text prototypes, achieving improved classification performance on benchmark datasets and in robot perception applications.

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes