CVAILGMar 26, 2024

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

arXiv:2403.17995v1
Originality Incremental advance
AI Analysis

This addresses the challenge of reducing annotation costs for image captioning in real-world applications, though it is an incremental improvement over existing semi-supervised methods.

The paper tackles the problem of semi-supervised image captioning, where limited labeled image-text pairs and many unlabeled images exist, by proposing SSIC-WGM, which uses scene graphs and Wasserstein distance to align visual and textual features, achieving improved performance on benchmark datasets like COCO with BLEU-4 scores up to 36.2.

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features. Existing approaches are mostly supervised ones, i.e., each image has a corresponding sentence in the training set. However, considering that describing images always requires a huge of manpower, we usually have limited amount of described images (i.e., image-text pairs) and a large number of undescribed images in real-world applications. Thereby, a dilemma is the "Semi-Supervised Image Captioning". To solve this problem, we propose a novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences. Different from traditional single modal semi-supervised methods, the difficulty of semi-supervised cross-modal learning lies in constructing intermediately comparable information among heterogeneous modalities. In this paper, SSIC-WGM adopts the successful scene graphs as intermediate information, and constrains the generated sentences from two aspects: 1) inter-modal consistency. SSIC-WGM constructs the scene graphs of the raw image and generated sentence respectively, then employs the wasserstein distance to better measure the similarity between region embeddings of different graphs. 2) intra-modal consistency. SSIC-WGM takes the data augmentation techniques for the raw images, then constrains the consistency among augmented images and generated sentences. Consequently, SSIC-WGM combines the cross-modal pseudo supervision and structure invariant measure for efficiently using the undescribed images, and learns more reasonable mapping function.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes