CVAIMMSep 7, 2024

POINTS: Improving Your Vision-language Model with Affordable Strategies

arXiv:2409.04828v316 citationsh-index: 8Has Code
AI Analysis

This work addresses inefficiencies in training vision-language models for researchers and practitioners, though it is incremental as it builds on existing methods.

The paper tackled issues in vision-language models, such as lack of transparency and inefficient data usage, by proposing affordable strategies including data filtering and model soup, resulting in a 9B parameter model that achieves competitive performance with state-of-the-art models.

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes