CapText: Large Language Model-based Caption Generation From Image Context and Description
This addresses the challenge of producing complementary captions for images, which is important for applications in accessibility and content creation, though it is incremental as it builds on existing large language models.
The paper tackles the problem of generating context-dependent captions for images by proposing a method that uses large language models to create captions from textual descriptions and context without directly processing images, and it outperforms state-of-the-art models like OSCAR-VinVL on the CIDEr metric.
While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary information about an image, while models tend to produce descriptions that describe the visual features of the image. Prior research in caption generation has explored the use of models that generate captions when provided with the images alongside their respective descriptions or contexts. We propose and evaluate a new approach, which leverages existing large language models to generate captions from textual descriptions and context alone, without ever processing the image directly. We demonstrate that after fine-tuning, our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.