CVAICLLGROJan 4, 2025

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

arXiv:2501.02189v696 citationsh-index: 18Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Synthesis-oriented
AI Analysis

It synthesizes existing knowledge for researchers and practitioners, but is incremental as it does not introduce new methods or results.

This paper provides a comprehensive survey of large vision-language models, covering their alignment methods, benchmarks, evaluations, and challenges such as hallucination and safety issues.

Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes