CLCVOct 10, 2022

Generating image captions with external encyclopedic knowledge

arXiv:2210.04806v12 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses the problem of generating more humanlike and contextualized image captions for applications in AI and computer vision, representing an incremental advance by building on existing captioning methods.

The paper tackles the challenge of incorporating contextual and encyclopedic knowledge into image caption generation by using image location to retrieve relevant facts from an external knowledge base, integrating them into encoding and decoding stages, and achieves significant improvements over baselines on a new dataset with knowledge-rich captions.

Accurately reporting what objects are depicted in an image is largely a solved problem in automatic caption generation. The next big challenge on the way to truly humanlike captioning is being able to incorporate the context of the image and related real world knowledge. We tackle this challenge by creating an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data. Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base, with their subsequent integration into the captioning pipeline at both the encoding and decoding stages. Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions, and achieves significant improvements over multiple baselines. We empirically demonstrate that our approach is effective for generating contextualized captions with encyclopedic knowledge that is both factually accurate and relevant to the image.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes