CVMar 18, 2024

Compositional Kronecker Context Optimization for Vision-Language Models

arXiv:2403.11631v12 citationsh-index: 59Neurocomputing
Originality Incremental advance
AI Analysis

This work addresses the problem of overfitting and limited generalization in prompt tuning for vision-language models, offering an incremental improvement for researchers and practitioners in computer vision and NLP.

The authors tackled the challenge of learning compact, generalizable context for adapting vision-language models to downstream tasks, proposing CK-CoOp which achieves state-of-the-art performance in base-to-new, domain, and cross-task generalization with fewer parameters and efficient speed.

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes