CVMay 24, 2023

Exploring Diverse In-Context Configurations for Image Captioning

arXiv:2305.14800v695 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving few-shot learning in vision-language models for researchers and practitioners, though it is incremental as it builds on existing in-context learning methods.

The paper tackled the problem of optimizing in-context configurations for vision-language tasks, specifically image captioning, by devising strategies for image selection and caption assignment, resulting in an average performance enhancement of 20.9 CIDEr scores compared to the baseline.

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, ie., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes