CLCVJun 28, 2023

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

arXiv:2306.16410v169 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of integrating vision and language for researchers and practitioners, offering a flexible, open-source tool that is incremental in leveraging existing LLMs.

The authors tackled computer vision problems by proposing LENS, a modular system that uses large language models (LLMs) to reason over outputs from descriptive vision modules, achieving competitive performance in zero- and few-shot object recognition and vision-language tasks without multimodal training.

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes