CVAICLSep 10, 2023

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

arXiv:2309.04965v285 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the problem of generating diverse and efficient image captions for real-world applications, representing an incremental improvement by combining existing techniques in a novel way.

The paper tackles the limited diversity and large parameter scale in image captioning systems by proposing Prefix-diffusion, a lightweight network that injects prefix image embeddings into a diffusion model to generate diverse captions with fewer parameters, achieving promising performance compared to recent approaches.

While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes