CLAIOct 15, 2023

VLIS: Unimodal Language Models Guide Multimodal Language Generation

arXiv:2310.09767v2131 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in multimodal AI for applications requiring nuanced language processing, though it is an incremental advancement building on existing models.

The paper tackles the challenge of vision-language models lacking complex linguistic understanding by introducing VLIS, a framework that combines visual conditioning with unimodal language models without training, improving performance on tasks like commonsense understanding and text generation.

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes