CVMar 24

Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion

Shuangwu Qian, Xiaochan Yuan, Pengfei Liu

arXiv:2603.2294614.8h-index: 1

Predicted impact top 55% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses the problem of generating accurate textual descriptions for culturally rich Dongba paintings, which is incremental as it adapts existing captioning methods to a new domain.

The paper tackles automatic captioning for Dongba paintings, a culturally specific art form, by proposing PVGF-DPC, an encoder-decoder framework that integrates a content prompt module and a visual semantic-generation fusion loss, achieving improved performance on a new dataset of 9,408 images with culturally grounded annotations.

Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbf{PVGF-DPC} (\textit{Prompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning}), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels -- such as \emph{deity}, \emph{ritual pattern}, or \emph{hell ghost} -- and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9{}408 augmented images with culturally grounded annotations spanning seven thematic categories.

View on arXiv PDF

Similar