NIMay 5
SprayCheck: Finding Gray Failures in Adaptive Routing NetworksJakob Krebs, Daniel Amir, Shir Landau Feibish et al.
Distributed machine learning (ML) training has become a dominant workload in modern data center networks, operating at massive scale with clusters comprising tens to hundreds of thousands of GPUs. The scale of these networks makes failures, and particularly gray failures, inevitable. Gray failures can significantly degrade both network and application performance, yet they are notoriously difficult to detect, localize, and debug. To meet the performance demands of ML workloads, adaptive routing is widely deployed to maximize network utilization by dynamically spreading traffic across many paths. While adaptive routing increases network utilization, it also greatly intensifies the effect of gray failures. Prior work has either dismissed gray failures as negligible or proposed detection mechanisms that fail to scale, rendering these approaches increasingly impractical for large-scale clusters. We present SprayCheck, a passive gray failure detection system that leverages the statistical properties of adaptive routing and network load balancing. By combining these properties with flow-level information, SprayCheck can identify failures before they significantly impact application performance, enabling preemptive rerouting and improving overall performance. Importantly, this is achieved through passive observation of traffic spraying, without introducing additional load on the network. We evaluate SprayCheck and show that it can detect and localize a single-link packet-drop-rate $1.5\%$ within a single iteration and as little as $0.5\%$ within 5 training iterations of Llama-3 70B in a 64 spine topology.
NIApr 26
NODE: Network Wide Top-K Flows in the Data PlaneEitan Stein, Lior Zeno, Shir Landau Feibish
Monitoring network traffic is crucial for most network tasks, such as, identifying and blocking attacks, pinpointing failures and engineering and rerouting heavy traffic to maintain high throughput. One important metric when monitoring the traffic is finding the top-k heavy flows, that is the k heaviest flows in the traffic. Programmable networks allow performing advanced network analysis right in the data plane. In recent years, various solutions have been proposed for efficiently finding the top-k heavy flows within a single switch. However, at times we may need to find the global top-k flows. Existing solutions for global top-k detection use a centralized controller that collects and aggregates the measurements performed in each of the switches. Yet, the process of sending information to the control plane and then having the controller send back the information to the switches can be very lengthy. In order to be able to detect and mitigate short-lived events, solutions that work completely within the data plane are needed. In this paper we present NODE, a network-wide top-k detection algorithm that operates exclusively in the data plane. NODE allows the switches to aggregate information from all other switches in the network, and ensures that eventually all switches hold an identical global top-k table. We show that NODE manages to detect global top-k flows on both synthetic and real traces, with a recall rate of over 95\% while using less than 300KB per switch.
CRDec 8, 2016
Efficient Distinct Heavy Hitters for DNS DDoS Attack DetectionYehuda Afek, Anat Bremler-Barr, Edith Cohen et al.
Motivated by a recent new type of randomized Distributed Denial of Service (DDoS) attacks on the Domain Name Service (DNS), we develop novel and efficient distinct heavy hitters algorithms and build an attack identification system that uses our algorithms. Heavy hitter detection in streams is a fundamental problem with many applications, including detecting certain DDoS attacks and anomalies. A (classic) heavy hitter (HH) in a stream of elements is a key (e.g., the domain of a query) which appears in many elements (e.g., requests). When stream elements consist of a <key; subkey> pairs, (<domain; subdomain>) a distinct heavy hitter (dhh) is a key that is paired with a large number of different subkeys. Our dHH algorithms are considerably more practical than previous algorithms. Specifically the new fixed-size algorithms are simple to code and with asymptotically optimal space accuracy tradeoffs. In addition we introduce a new measure, a combined heavy hitter (cHH), which is a key with a large combination of distinct and classic weights. Efficient algorithms are also presented for cHH detection. Finally, we perform extensive experimental evaluation on real DNS attack traces, demonstrating the effectiveness of both our algorithms and our DNS malicious queries identification system.