CVAIDec 23, 2025

Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva

arXiv:2512.20042v1
Originality Incremental advance
AI Analysis

This addresses the need for richer image descriptions in domains like journalism and education, though it appears incremental as it builds on existing methods.

The paper tackles the problem of image captions lacking contextual depth by proposing a multimodal pipeline that augments visual input with external textual knowledge, resulting in significantly more informative captions evaluated on the OpenEvents v1 dataset.

Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes