CLMar 20, 2023

Retrieving Multimodal Information for Augmented Generation: A Survey

arXiv:2303.10868v3172 citationsh-index: 62
Originality Synthesis-oriented
AI Analysis

It addresses the lack of unified perception in incorporating multimodality for LLMs, which is an incremental survey for researchers in AI and natural language processing.

This survey reviews methods that augment large language models by retrieving multimodal knowledge from various formats to address concerns like factuality and reasoning, aiming to provide scholars with a deeper understanding and encourage adaptation to the fast-growing LLM field.

As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes