DCAIDec 23, 2024

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

arXiv:2501.14755v38 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the problem of scalable data processing for foundation model developers, though it appears incremental as an improvement over its predecessor.

The paper tackles the challenge of processing vast, multimodal datasets for foundation models by introducing Data-Juicer 2.0, a system that efficiently handles TB-level data with 10k+ CPU cores and supports over 100 data processing operators across text, image, video, and audio modalities.

Foundation models demand advanced data processing for their vast, multimodal datasets. However, traditional frameworks struggle with the unique complexities of multimodal data. In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes