CVOct 14, 2025

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji

arXiv:2510.12126v3h-index: 26Has Code

Originality Highly original

AI Analysis

This provides a cost-effective solution for generalist visual captioning, benefiting multimodal research applications like data synthesis.

The paper tackled the performance gap between open-source and commercial visual captioning models by introducing CapFlow, a multi-agent collaboration workflow that achieves caption quality comparable to GPT-4.1 with an 89.5% reduction in costs, and used it to train MetaCaptioner, which reaches top-tier performance in the open-source community.

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

View on arXiv PDF

Similar