Data Metabolism: An Efficient Data Design Schema For Vision Language Model
This work addresses the challenge of efficiently building high-performing VLMs for multimodal AI applications, though it appears incremental as it builds on existing model architectures with a focus on data design.
The paper tackles the problem of data curation for training Visual Language Models (VLMs) by introducing a data-centric framework called Data Metabolism, which includes data curation and iteration steps, and demonstrates its effectiveness with Capybara-VL, a compact VLM that surpasses open-source models up to 10 times larger and matches proprietary models in multimodal tasks.
Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.