LG AISep 4, 2023

Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

arXiv:2309.01640v12.0

Originality Incremental advance

AI Analysis

This addresses data shuffling bottlenecks for machine learning practitioners training models on large datasets, though it is incremental as it builds on prior work.

The paper tackles the inefficiency of random data access in SGD for large cloud-stored datasets by introducing a hybrid offline-online shuffling strategy that combines CorgiPile with an online iteration, achieving performance similar to random access without compromising efficiency.

When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.

View on arXiv PDF

Similar