Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck
This addresses a bottleneck in vision LLMs for AI researchers, but it is incremental as it identifies and confirms an existing limitation without proposing a new solution.
The paper tackled the problem that large language models (LLMs) lack hierarchical knowledge in visual understanding, making them a bottleneck for vision LLMs, as shown by experiments with about one million VQA tasks across six taxonomies and four image datasets, where finetuning improved LLMs' hierarchical consistency more than vision LLMs'.
This paper reveals that many state-of-the-art large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect to some extent because the VQA tasks improve the LLM's hierarchical consistency more than the vision LLM's. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.