A-CAP: Anticipation Captioning with Commonsense Knowledge
This addresses the challenge of future reasoning in vision-language tasks for AI systems, but it is incremental as it builds on existing models and datasets.
The paper tackles the problem of generating captions for unseen images using sparse temporal visual cues, introducing the Anticipation Captioning task, and proposes A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, outperforming other methods on a customized dataset.
Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.