CVOct 17, 2024

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

arXiv:2410.13651v1h-index: 30
Originality Incremental advance
AI Analysis

This work addresses the problem of visual concept identification for AI systems by providing an explainable, zero-shot approach that mimics human reasoning, though it is incremental as it builds on existing LLM and VQA technologies.

The paper tackles zero-shot fine-grained visual concept learning by using a large language model (GPT-3) to generate linguistic descriptions of objects, converting them into binary questions for a Visual Question Answering (VQA) system to identify objects in images, achieving performance comparable to existing zero-shot and few-shot methods while being explainable.

An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes