CV AISep 19, 2025

RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

Xiaosheng Long, Hanyu Wang, Zhentao Song, Kun Luo, Hongde Liu

arXiv:2509.15883v1h-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of fine-grained relation modeling in image captioning for applications requiring lightweight models, though it appears incremental as it builds on existing retrieval-augmented methods.

The paper tackled the problem of relation modeling in retrieval-augmented image captioning by proposing RACap, which mines structured relation semantics from captions and identifies heterogeneous objects from images, achieving superior performance with only 10.8M trainable parameters.

Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.

View on arXiv PDF

Similar