CVDec 21, 2023

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo

arXiv:2312.14238v365.33174 citationsh-index: 63Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses the need for scalable and aligned vision-language models to advance multimodal AGI systems, representing a significant but incremental improvement over existing models like ViT-22B.

The authors tackled the lag in vision-language foundation models compared to LLMs by developing InternVL, a 6-billion-parameter model that achieves state-of-the-art performance on 32 generic visual-linguistic benchmarks, including tasks like zero-shot classification and retrieval.

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

View on arXiv PDF Code

Similar