CVDec 14, 2023

Pixel Aligned Language Models

arXiv:2312.09237v119 citationsh-index: 24CVPR
Originality Highly original
AI Analysis

This work addresses the challenge of integrating precise spatial localization into vision-language models for tasks like object captioning and grounding, representing a novel extension rather than an incremental improvement.

The authors tackled the problem of enabling large language models to handle localization tasks in vision-language settings, such as word grounding and referring localization, by developing a model that processes and generates pixel coordinates, achieving state-of-the-art performance on RefCOCO and Visual Genome benchmarks.

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes