CLAICVMMNov 10, 2023

How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model

arXiv:2311.07594v312 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

It provides a systematic review of modality alignment methods for MLLMs, which is incremental as it categorizes existing approaches without introducing new techniques.

This paper surveys Multimodal Large Language Models (MLLMs) that integrate LLMs like GPT-4 to handle multimodal data such as text and images, addressing challenges like the semantic gap and modality alignment to improve capabilities like image captioning and question-answering.

We explore Multimodal Large Language Models (MLLMs), which integrate LLMs like GPT-4 to handle multimodal data, including text, images, audio, and more. MLLMs demonstrate capabilities such as generating image captions and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs, posing potential risks to society. Selecting the appropriate modality alignment method is crucial, as improper methods might require more parameters without significant performance improvements. This paper aims to explore modality alignment methods for LLMs and their current capabilities. Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility. The study surveys existing modality alignment methods for MLLMs, categorizing them into four groups: (1) Multimodal Converter, which transforms data into a format that LLMs can understand; (2) Multimodal Perceiver, which improves how LLMs percieve different types of data; (3) Tool Learning, which leverages external tools to convert data into a common format, usually text; and (4) Data-Driven Method, which teaches LLMs to understand specific data types within datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes