CVAICLDec 28, 2023

MIVC: Multiple Instance Visual Component for Visual-Language Models

arXiv:2312.17109v14 citationsh-index: 3WACV
Originality Incremental advance
AI Analysis

This addresses the challenge of handling varying numbers of images in vision-language models for e-commerce applications, representing an incremental improvement.

The paper tackles the problem of consolidating entity understanding from multiple images and aligning it with pre-trained language models for generative tasks, proposing MIVC to aggregate visual representations and showing consistent performance improvements on visual question answering, classification, and captioning tasks on an e-commerce dataset.

Vision-language models have been widely explored across a wide range of tasks and achieve satisfactory performance. However, it's under-explored how to consolidate entity understanding through a varying number of images and to align it with the pre-trained language models for generative tasks. In this paper, we propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models by aggregating visual representations in a permutation-invariant fashion through a neural network. We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks on a public available e-commerce dataset with multiple images per product. Furthermore, we show that the component provides insight into the contribution of each image to the downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes