CVCLDBJul 3, 2024

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

arXiv:2407.03314v24 citationsh-index: 12
AI Analysis

This addresses the issue of dense captions hindering multi-modal models for researchers and practitioners, offering an incremental improvement through structured prompting.

The paper tackles the problem of unclear and complex image captions generated by large Vision-Language Models by proposing BACON, a method that structures captions into disentangled elements like objects and relationships, which improved caption quality and enabled models like GroundingDINO to achieve 1.51x higher recall scores in object detection tasks.

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes