Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?
This work addresses the problem of visual concept identification for AI systems by providing an explainable, zero-shot approach that mimics human reasoning, though it is incremental as it builds on existing LLM and VQA technologies.
The paper tackles zero-shot fine-grained visual concept learning by using a large language model (GPT-3) to generate linguistic descriptions of objects, converting them into binary questions for a Visual Question Answering (VQA) system to identify objects in images, achieving performance comparable to existing zero-shot and few-shot methods while being explainable.
An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.