CVDec 6, 2023

Language-Informed Visual Concept Learning

Stanford
arXiv:2312.03587v213 citationsh-index: 76ICLR
AI Analysis

This work addresses the challenge of bridging language and visual perception for applications in image generation and editing, though it is incremental as it builds on existing pre-trained models.

The paper tackles the problem of learning visual concept representations that are informed by language but capture visual nuances beyond linguistic articulation, by distilling large pre-trained vision-language models to encode concepts along axes like color or style, enabling image generation with novel concept compositions and generalization to unseen concepts through lightweight finetuning.

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes