CV AI CL LG ROJan 4, 2025

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, Guangyao Shi

arXiv:2501.02189v639.996 citationsh-index: 18Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Synthesis-oriented

AI Analysis

It synthesizes existing knowledge for researchers and practitioners, but is incremental as it does not introduce new methods or results.

This paper provides a comprehensive survey of large vision-language models, covering their alignment methods, benchmarks, evaluations, and challenges such as hallucination and safety issues.

Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.

View on arXiv PDF Code

Similar