TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
This addresses the challenge of poor generalization and bias in multimodal AI applications by automating task diversity enhancement, though it is incremental as it builds on existing fine-tuning methods.
The paper tackles the problem of insufficient task-specific data limiting multimodal visual language models by introducing TaskGalaxy, a large-scale dataset with 19,227 task types and 413,648 samples, which improves performance across 16 benchmarks when integrated into models like LLaVA-v1.5 and InternVL-Chat-v1.0.
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.