CVAISep 19, 2025

RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

arXiv:2509.15883v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of fine-grained relation modeling in image captioning for applications requiring lightweight models, though it appears incremental as it builds on existing retrieval-augmented methods.

The paper tackled the problem of relation modeling in retrieval-augmented image captioning by proposing RACap, which mines structured relation semantics from captions and identifies heterogeneous objects from images, achieving superior performance with only 10.8M trainable parameters.

Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes