ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
This provides a flexible, zero-shot method for image captioning and high-level vision tasks, addressing limitations in supervised methods, though it is incremental as it repurposes existing models.
The authors tackled the problem of generating descriptive text from images without training, by combining a visual-semantic model with a large language model, enabling zero-shot image-to-text generation and novel capabilities like image arithmetic and visual analogy tests.
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning steps. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text, and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.