DCAIDBLGFeb 25

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

arXiv:2602.22434v1h-index: 6
Originality Incremental advance
AI Analysis

This addresses a bottleneck in distributed ML training pipelines by improving data loading efficiency, though it is incremental as it builds on existing object storage systems.

The paper tackled the problem of high overhead from issuing thousands of individual GET requests for data loading in ML training by introducing GetBatch, a new object store API that replaces independent operations with a single batch retrieval, achieving up to 15x throughput improvement and reducing P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x.

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes