LGMLNov 1, 2019

Progressive Compressed Records: Taking a Byte out of Deep Learning Data

arXiv:1911.00472v411 citations
Originality Highly original
AI Analysis

This addresses the data transfer bottleneck in deep learning training for users of commodity networks and storage, offering a practical solution with significant speed improvements.

The paper tackled the problem of deep learning training being bottlenecked by data transfer bandwidth by introducing Progressive Compressed Records (PCRs), a data format that reduces training time by up to 50% in bandwidth usage, potentially doubling training speed while maintaining accuracy.

Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50% of the original encoding for many DL training tasks; (ii) it is possible to automatically and efficiently select appropriate compression levels for a given task; and (iii) PCRs enable tasks to readily access compressed data at runtime---utilizing as little as half the training bandwidth and thus potentially doubling training speed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes