LGDSMLDec 15, 2015

Data Driven Resource Allocation for Distributed Learning

arXiv:1512.04848v29 citations
Originality Incremental advance
AI Analysis

This work addresses resource allocation challenges in distributed learning for applications like image and advertising data, offering incremental improvements with provable guarantees and scalability.

The paper tackles the problem of data allocation in distributed machine learning by proposing data-dependent dispatching that leverages local simplicity in classification rules, achieving significantly higher accuracy on synthetic and real-world datasets compared to baselines like random partitioning and locality-sensitive hashing.

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes