On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers
This work addresses latency optimization in distributed systems for researchers and engineers, but it is incremental as it builds on existing redundancy-based algorithms.
The paper tackles the problem of minimizing expected completion time in distributed computing systems with stragglers by analyzing task-to-worker assignment strategies, showing that uniform replication of non-overlapping batches achieves the minimum expected computing time.
We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically show that not only the amount of redundancy but also the task-to-machine assignments affect the latency in a distributed system. We study systems with a fixed number of computing tasks that are split in possibly overlapping batches, and independent exponentially distributed machine service times. We show that, for such systems, the uniform replication of non- overlapping (disjoint) batches of computing tasks achieves the minimum expected computing time.