CVAILGJun 11, 2024

Understanding Visual Concepts Across Models

arXiv:2406.07506v1
Originality Incremental advance
AI Analysis

This reveals a fundamental limitation in visual concept learning for multimodal AI systems, showing that fine-tuning gains do not generalize across models.

The paper investigates whether different multimodal models learn similar word embeddings for new visual concepts after fine-tuning, finding that embeddings are model-specific and non-transferable across three state-of-the-art models, with perturbations within an ε-ball generating arbitrary concepts.

Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $ε$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes