CVAICLFeb 8, 2024

CIC: A Framework for Culturally-Aware Image Captioning

arXiv:2402.05374v512 citationsh-index: 5IJCAI
Originality Incremental advance
AI Analysis

This addresses the lack of cultural detail in image captioning for users needing culturally descriptive outputs, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of generating image captions that describe cultural elements, such as traditional clothing, by proposing the CIC framework, which uses visual question answering and large language models to produce culturally-aware captions, with human evaluation showing it outperforms baseline methods.

Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic..

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes