LG AIMay 20, 2025

Collaborative Unlabeled Data Optimization

Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin

arXiv:2505.14117v24.1h-index: 3

Originality Highly original

AI Analysis

This addresses the bottleneck of knowledge reusability and scalability in model-centric approaches for deep learning practitioners.

The paper tackles the problem of inefficient deep learning training by optimizing unlabeled data to encode knowledge directly into the data, achieving improvements of 13.6% and 6.8% on Tiny-ImageNet and ImageNet-1K with training speedups of 1.94x and 1.2x.

This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.

View on arXiv PDF

Similar