CVCLMar 5

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

arXiv:2603.04957v1
Originality Incremental advance
AI Analysis

This work addresses the problem of generating detailed image captions for users who need more fine-grained descriptions from multimodal models, offering a more compact and efficient solution.

This paper introduces VisionPangu, a compact 1.7B-parameter multimodal model that improves detailed image captioning. It achieves this by combining an InternVL vision encoder with an OpenPangu-Embedded language backbone and using dense human-authored descriptions from the DOCCI dataset, resulting in more structured and detailed captions.

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes