Garfield: System Support for Byzantine Machine Learning
This addresses the vulnerability of ML systems to Byzantine failures, offering a practical solution for developers, though it is incremental in improving resilience within existing frameworks.
The paper tackles the problem of making machine learning applications Byzantine-resilient by introducing Garfield, a library that reduces coding effort and addresses vulnerabilities in frameworks like TensorFlow and PyTorch, showing that Byzantine resilience induces accuracy loss and has throughput overhead primarily from communication.
We present Garfield, a library to transparently make machine learning (ML) applications, initially built with popular (but fragile) frameworks, e.g., TensorFlow and PyTorch, Byzantine-resilient. Garfield relies on a novel object-oriented design, reducing the coding effort, and addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks. Garfield encompasses various communication patterns and supports computations on CPUs and GPUs, allowing addressing the general question of the very practical cost of Byzantine resilience in SGD-based ML applications. We report on the usage of Garfield on three main ML architectures: (a) a single server with multiple workers, (b) several servers and workers, and (c) peer-to-peer settings. Using Garfield, we highlight several interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine resilience, unlike crash resilience, induces an accuracy loss, (b) the throughput overhead comes more from communication than from robust aggregation, and (c) tolerating Byzantine servers costs more than tolerating Byzantine workers.