Bridging the gap to real-world language-grounded visual concept learning
This work addresses the problem of enabling more flexible and varied visual concept learning for AI systems, though it appears incremental as it builds upon pretrained vision-language models.
The paper tackles the limitation of existing language-grounded visual concept learning methods, which are restricted to predefined axes like color and shape in synthetic datasets, by proposing a scalable framework that adaptively identifies and grounds diverse visual concept axes in real-world scenes, achieving superior editing capabilities and strong compositional generalization on datasets like ImageNet, CelebA-HQ, and AFHQ.
Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.