K. K. Ramakrishnan

DC
h-index9
4papers
12citations
Novelty53%
AI Score40

4 Papers

22.1NIMar 21
immUNITY: Detecting and Mitigating Low Volume & Slow Attacks with Programmable Switches and SmartNICs

Cuidi Wei, Shaoyu Tu, Daiki Hata et al.

Our analysis of recent Internet traces shows that up to 71% of flows contain suspicious behaviors indicative of low-volume network attacks such as port scans. However, distinguishing anomalous traffic in real time is challenging as each attack flow may comprise only a few packets. We extend prior work that tracks heavy hitter flows to also detect low-volume and slow attacks by combining the capabilities of both switches and SmartNICs. We flip the usual design approach by proposing an efficient filter data structure used to quickly route traffic marked as benign towards destination end-systems. We make careful use of limited programmable switch memory and pipeline stages, and complement them with SmartNIC resources to analyze the remaining traffic that may be anomalous. Using machine learning classifiers and intrusion detection rules deployed on the SmartNIC, we identify malicious source IPs, which then undergo more detailed forensics for attack mitigation. Finally, we develop a dataplane based protocol to rapidly coordinate data structure updates between these devices. We implement immUNITY in a testbed with Tofino v1 switch and Bluefield 3 SmartNIC, demonstrating its high accuracy, while minimizing traffic that's analyzed outside the switch.

DCMay 5, 2024
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

Shixiong Qi, K. K. Ramakrishnan, Myungjin Lee

Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.

DCJan 4
Making MoE based LLM inference resilient with Tarragon

Songyu Zhang, Aaron Tam, Myungjin Lee et al.

Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.

NEAug 8, 2020
Spatial Sharing of GPU for Autotuning DNN models

Aditya Dhakal, Junguk Cho, Sameer G. Kulkarni et al.

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference is tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an interdependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. While a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the causes that impact the tuning of a model at different amounts of GPU resources. We present many techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by up to 75 percent and increase throughput by a factor of 5.