CL CVOct 29, 2024

Are VLMs Really Blind

arXiv:2410.22029v11.0h-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses a critical limitation in VLMs for applications requiring geometric reasoning, though it is incremental as it builds on existing VQA methods.

The paper tackled the problem of Vision Language Models performing poorly on low-level basic visual tasks, and found that using a novel automatic pipeline to extract key information from images via question-derived captions enables precise answers without fine-tuning.

Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.

View on arXiv PDF Code

Similar