CVCLLGAug 17, 2023

Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning

arXiv:2309.16705v2h-index: 13
AI Analysis

This work addresses the gap in understanding multimodal AI capabilities for researchers and developers, though it is incremental as it builds on existing models without introducing new methods.

The study tackled the problem of evaluating visual comprehension in large language models by testing Google Bard and GPT-Vision on 64 visual tasks, finding they excel at visual CAPTCHAs but struggle with tasks like recreating ASCII art or analyzing Tic Tac Toe grids.

Addressing the gap in understanding visual comprehension in Large Language Models (LLMs), we designed a challenge-response study, subjecting Google Bard and GPT-Vision to 64 visual tasks, spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction." Previous models, such as GPT4, leaned heavily on optical character recognition tools like Tesseract, whereas Bard and GPT-Vision, akin to Google Lens and Visual API, employ deep learning techniques for visual text recognition. However, our findings spotlight both vision-language model's limitations: while proficient in solving visual CAPTCHAs that stump ChatGPT alone, it falters in recreating visual elements like ASCII art or analyzing Tic Tac Toe grids, suggesting an over-reliance on educated visual guesses. The prediction problem based on visual inputs appears particularly challenging with no common-sense guesses for next-scene forecasting based on current "next-token" multimodal models. This study provides experimental insights into the current capacities and areas for improvement in multimodal LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes