Large Multimodal Models: Notes on CVPR 2023 Tutorial
It provides an overview for researchers interested in advancing large multimodal models, but it is incremental as it summarizes existing work without new results.
This tutorial note summarizes a presentation on building and surpassing multimodal GPT-4-like models, covering background on vision-and-language models, instruction-tuning basics, and prototyping with open-source resources.
This tutorial note summarizes the presentation on ``Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4'', a part of CVPR 2023 tutorial on ``Recent Advances in Vision Foundation Models''. The tutorial consists of three parts. We first introduce the background on recent GPT-like large models for vision-and-language modeling to motivate the research in instruction-tuned large multimodal models (LMMs). As a pre-requisite, we describe the basics of instruction-tuning in large language models, which is further extended to the multimodal space. Lastly, we illustrate how to build the minimum prototype of multimodal GPT-4 like models with the open-source resource, and review the recently emerged topics.