CLCVAug 21, 2023

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

arXiv:2308.10755v348 citationsh-index: 30Has Code
Originality Synthesis-oriented
AI Analysis

This dataset addresses the problem of data scarcity and lack of transparency in training data for large models, benefiting researchers and developers in AI and ML, though it is incremental as it provides new data rather than a novel method.

The paper introduces WanJuan, a large-scale multimodal dataset exceeding 2TB with text, image-text, and video modalities in Chinese and English, collected from web sources to address the lack of open-source data, and it was used to train InternLM, which showed significant advantages in multi-dimensional evaluations compared to similar-scale models.

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes