CLCVOct 29, 2024

Are VLMs Really Blind

arXiv:2410.22029v1h-index: 4
Originality Incremental advance
AI Analysis

This addresses a critical limitation in VLMs for applications requiring geometric reasoning, though it is incremental as it builds on existing VQA methods.

The paper tackled the problem of Vision Language Models performing poorly on low-level basic visual tasks, and found that using a novel automatic pipeline to extract key information from images via question-derived captions enables precise answers without fine-tuning.

Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes