CVJan 19

Dual-Stream Collaborative Transformer for Image Captioning

arXiv:2601.12926v11.5

Originality Incremental advance

AI Analysis

This addresses the issue of generating more accurate captions for image captioning tasks, though it appears incremental as it builds on existing region feature methods.

The paper tackles the problem of irrelevant descriptions in image captioning by proposing a Dual-Stream Collaborative Transformer that fuses region and segmentation features, resulting in outperforming state-of-the-art models on benchmark datasets.

Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.

View on arXiv PDF

Similar