Rodrigo Fonseca

DC
h-index47
17papers
763citations
Novelty57%
AI Score57

17 Papers

DCMay 6
A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

Roozbeh Bostandoost, Pooria Namyar, Siva Kesava Reddy Kakarla et al.

Cloud operators increasingly deploy multiple ML models in their VM allocation pipelines. In such settings, individually benign predictions can shift and compound, severely degrading performance. In a cloud provider's VM placement pipeline, CPU, memory, and lifetime prediction models jointly determine server count, live migration frequency, and network utilization; yet no existing approach can systematically stress-test how these models adversely interact. Deterministic adversarial analyzers cannot capture probabilistic ML behavior, so operators miss failures that arise only from correlated distributional shifts across models In SANJESH, we formulate a bi-level optimization that captures how the ML models behave statistically and uncovers how they adversely interact. The outer level searches over what predictions the ML models could produce under distributional uncertainty to find adversarial conditions; the inner level evaluates how the VM allocator behaves given those predictions. When we applied it to the operator's production traces, SANJESH uncovered scenarios that cause $4\times$ worse performance than the operators' evaluator detected.

CLSep 11, 2023
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

Dylan Zhang, Xuchao Zhang, Chetan Bansal et al.

Major cloud providers have employed advanced AI-based solutions like large language models to aid humans in identifying the root causes of cloud incidents. Despite the growing prevalence of AI-driven assistants in the root cause analysis process, their effectiveness in assisting on-call engineers is constrained by low accuracy due to the intrinsic difficulty of the task, a propensity for LLM-based approaches to hallucinate, and difficulties in distinguishing these well-disguised hallucinations. To address this challenge, we propose to perform confidence estimation for the predictions to help on-call engineers make decisions on whether to adopt the model prediction. Considering the black-box nature of many LLM-based root cause predictors, fine-tuning or temperature-scaling-based approaches are inapplicable. We therefore design an innovative confidence estimation framework based on prompting retrieval-augmented large language models (LLMs) that demand a minimal amount of information from the root cause predictor. This approach consists of two scoring phases: the LLM-based confidence estimator first evaluates its confidence in making judgments in the face of the current incident that reflects its ``grounded-ness" level in reference data, then rates the root cause prediction based on historical references. An optimization step combines these two scores for a final confidence assignment. We show that our method is able to produce calibrated confidence estimates for predicted root causes, validate the usefulness of retrieved historical data and the prompting strategy as well as the generalizability across different root cause prediction models. Our study takes an important move towards reliably and effectively embedding LLMs into cloud incident management systems.

ARApr 3
Octopus: Enhancing CXL Memory Pods via Sparse Topology

Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti et al.

The Compute Express Link (CXL) interconnect enables compute "pods" that pool memory across servers to reduce cost and improve efficiency. These pods also facilitate pairwise communication whose needs conflict with pooling. Importantly, existing pod designs are small or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all servers to all CXL pooling devices. This paper breaks with this conventional wisdom by introducing Octopus pods. Octopus directly connects servers to low-port-count CXL pooling devices (e.g., 4 ports) yet scales to large pods without switches by constructing a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers. Octopus explicitly balances "overlap", where two servers connect to the same pooling device: overlap reduces pooling efficiency but enables low-latency communication. Octopus resolves this tension by grouping servers into "islands" with low-latency intra-island communication and interconnecting islands to favor pooling. We build a three-server CXL pod prototype and simulate scaled pods with 96 servers under measured device characteristics and physical constraints (1.5 m copper cables). On hardware, Octopus RPCs are 3.2x faster than in-rack RDMA and 2.4x faster than CXL switches. In simulation, Octopus achieves net server cost savings of 3-5.4% whereas CXL switches result in a net cost increase.

OSApr 13
Nanvix: A Multikernel OS Design for High-Density Serverless Deployments

Carlos Segarra, Pedro Henrique Penna, Enrique Saurez et al.

Serverless providers strive for high resource utilization by optimizing deployment density: how many applications can be deployed per host server. However, achieving high deployment density without compromising application performance or isolation remains an open challenge. High density can be achieved by sharing components across applications, yet applications from different tenants must be strongly isolated from each other due to the risk of side-channel attacks. Sharing components across applications from the same tenant, if done naively, can introduce contention on host resources thus negatively affecting application performance. We describe Nanvix, a new multikernel OS that disaggregates ephemeral execution state, unique per application invocation, from long-lived persistent state, shared among invocations from the same tenant. Applications in Nanvix execute inside a lightweight user VM running a micro-kernel that implements threads and memory, and forwards all I/O requests to a system VM. The system VM runs a macro-kernel with a rich set of device drivers and is shared among all invocations from the same tenant. Nanvix' split design achieves strong hypervisor isolation across tenants without sacrificing application performance, and reduces same-tenant contention by multiplexing all I/O requests to the system VM. Thanks to a system-wide co-design, Nanvix achieves order-of-magnitude lower application start up times with moderate I/O overheads. When replaying a production trace, Nanvix needs 20-100x fewer host servers compared to state-of-the-art systems, improving deployment density

DCMar 6
StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

Haoran Qiu, Gohar Irfan Chaudhry, Chaojie Zhang et al.

Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.

DCFeb 2, 2025Code
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Haoran Qiu, Anish Biswas, Zihan Zhao et al.

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

LGJun 17, 2019Code
Sample-Efficient Neural Architecture Search by Learning Action Space

Linnan Wang, Saining Xie, Teng Li et al.

Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing MCTS based NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy), leading to sample-inefficient explorations of architectures. To improve the sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS), which learns actions to recursively partition the search space into good or bad regions that contain networks with similar performance metrics. During the search phase, as different action sequences lead to regions with different performance, the search efficiency can be significantly improved by biasing towards the good regions. On three NAS tasks, empirical results demonstrate that LaNAS is at least an order more sample efficient than baseline methods including evolutionary algorithms, Bayesian optimizations, and random search. When applied in practice, both one-shot and regular LaNAS consistently outperform existing results. Particularly, LaNAS achieves 99.0% accuracy on CIFAR-10 and 80.8% top1 accuracy at 600 MFLOPS on ImageNet in only 800 samples, significantly outperforming AmoebaNet with 33x fewer samples. Our code is publicly available at https://github.com/facebookresearch/LaMCTS.

SEMar 7, 2024
Exploring LLM-based Agents for Root Cause Analysis

Devjeet Roy, Xuchao Zhang, Rashi Bhave et al.

The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.

DCJan 5, 2025
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri et al.

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.

DCJan 28, 2025
Towards Resource-Efficient Compound AI Systems

Gohar Irfan Chaudhry, Esha Choukse, Íñigo Goiri et al.

Compound AI Systems, integrating multiple interacting components like models, retrievers, and external tools, have emerged as essential for addressing complex AI tasks. However, current implementations suffer from inefficient resource utilization due to tight coupling between application logic and execution details, a disconnect between orchestration and resource management layers, and the perceived exclusiveness between efficiency and quality. We propose a vision for resource-efficient Compound AI Systems through a declarative workflow programming model and an adaptive runtime system for dynamic scheduling and resource-aware decision-making. Decoupling application logic from low-level details exposes levers for the runtime to flexibly configure the execution environment and resources, without compromising on quality. Enabling collaboration between the workflow orchestration and cluster manager enables higher efficiency through better scheduling and resource management. We are building a prototype system, called Murakkab, to realize this vision. Our preliminary evaluation demonstrates speedups up to $\sim 3.4\times$ in workflow completion times while delivering $\sim 4.5\times$ higher energy efficiency, showing promise in optimizing resources and advancing AI system design.

MAAug 22, 2025
Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms

Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu et al.

Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today's frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that Murakkab reduces GPU usage by up to 2.8$\times$, energy consumption by 3.7$\times$, and cost by 4.3$\times$ while maintaining SLOs.

AIFeb 20
WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Madhav Kanda, Pedro Las-Casas, Alok Gautam Kumbhare et al.

LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.

LGJul 1, 2020
Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search

Linnan Wang, Rodrigo Fonseca, Yuandong Tian

High dimensional black-box optimization has broad applications but remains a challenging problem to solve. Given a set of samples $\{\vx_i, y_i\}$, building a global model (like Bayesian Optimization (BO)) suffers from the curse of dimensionality in the high-dimensional search space, while a greedy search may lead to sub-optimality. By recursively splitting the search space into regions with high/low function values, recent works like LaNAS shows good performance in Neural Architecture Search (NAS), reducing the sample complexity empirically. In this paper, we coin LA-MCTS that extends LaNAS to other domains. Unlike previous approaches, LA-MCTS learns the partition of the search space using a few samples and their function values in an online fashion. While LaNAS uses linear partition and performs uniform sampling in each region, our LA-MCTS adopts a nonlinear decision boundary and learns a local model to pick good candidates. If the nonlinear partition function and the local model fits well with ground-truth black-box function, then good partitions and candidates can be reached with much fewer samples. LA-MCTS serves as a \emph{meta-algorithm} by using existing black-box optimizers (e.g., BO, TuRBO) as its local models, achieving strong performance in general black-box optimization and reinforcement learning benchmarks, in particular for high-dimensional problems.

LGJun 11, 2020
Few-shot Neural Architecture Search

Yiyang Zhao, Linnan Wang, Yuandong Tian et al.

Efficient evaluation of a network architecture drawn from a large search space remains a key challenge in Neural Architecture Search (NAS). Vanilla NAS evaluates each architecture by training from scratch, which gives the true performance but is extremely time-consuming. Recently, one-shot NAS substantially reduces the computation cost by training only one supernetwork, a.k.a. supernet, to approximate the performance of every architecture in the search space via weight-sharing. However, the performance estimation can be very inaccurate due to the co-adaption among operations. In this paper, we propose few-shot NAS that uses multiple supernetworks, called sub-supernet, each covering different regions of the search space to alleviate the undesired co-adaption. Compared to one-shot NAS, few-shot NAS improves the accuracy of architecture evaluation with a small increase of evaluation cost. With only up to 7 sub-supernets, few-shot NAS establishes new SoTAs: on ImageNet, it finds models that reach 80.5% top-1 accuracy at 600 MB FLOPS and 77.5% top-1 accuracy at 238 MFLOPS; on CIFAR10, it reaches 98.72% top-1 accuracy without using extra data or transfer learning. In Auto-GAN, few-shot NAS outperforms the previously published results by up to 20%. Extensive experiments show that few-shot NAS significantly improves various one-shot methods, including 4 gradient-based and 6 search-based methods on 3 different tasks in NasBench-201 and NasBench1-shot-1.

CVMar 26, 2019
AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search

Linnan Wang, Yiyang Zhao, Yuu Jinnai et al.

Neural Architecture Search (NAS) has shown great success in automating the design of neural networks, but the prohibitive amount of computations behind current NAS methods requires further investigations in improving the sample efficiency and the network evaluation cost to get better results in a shorter time. In this paper, we present a novel scalable Monte Carlo Tree Search (MCTS) based NAS agent, named AlphaX, to tackle these two aspects. AlphaX improves the search efficiency by adaptively balancing the exploration and exploitation at the state level, and by a Meta-Deep Neural Network (DNN) to predict network accuracies for biasing the search toward a promising region. To amortize the network evaluation cost, AlphaX accelerates MCTS rollouts with a distributed design and reduces the number of epochs in evaluating a network by transfer learning guided with the tree structure in MCTS. In 12 GPU days and 1000 samples, AlphaX found an architecture that reaches 97.84\% top-1 accuracy on CIFAR-10, and 75.5\% top-1 accuracy on ImageNet, exceeding SOTA NAS methods in both the accuracy and sampling efficiency. Particularly, we also evaluate AlphaX on NASBench-101, a large scale NAS dataset; AlphaX is 3x and 2.8x more sample efficient than Random Search and Regularized Evolution in finding the global optimum. Finally, we show the searched architecture improves a variety of vision applications from Neural Style Transfer, to Image Captioning and Object Detection.

CRJul 23, 2018
Scanning the Internet for ROS: A View of Security in Robotics Research

Nicholas DeMarinis, Stefanie Tellex, Vasileios Kemerlis et al.

Because robots can directly perceive and affect the physical world, security issues take on particular importance. In this paper, we describe the results of our work on scanning the entire IPv4 address space of the Internet for instances of the Robot Operating System (ROS), a widely used robotics platform for research. Our results identified that a number of hosts supporting ROS are exposed to the public Internet, thereby allowing anyone to access robotic sensors and actuators. As a proof of concept, and with consent, we were able to read image sensor information and move the robot of a research group in a US university. This paper gives an overview of our findings, including the geographic distribution of publicly-accessible platforms, the sorts of sensor and actuator data that is available, as well as the different kinds of robots and sensors that our scan uncovered. Additionally, we offer recommendations on best practices to mitigate these security issues in the future.

LGMay 18, 2018
Neural Architecture Search using Deep Neural Networks and Monte Carlo Tree Search

Linnan Wang, Yiyang Zhao, Yuu Jinnai et al.

Neural Architecture Search (NAS) has shown great success in automating the design of neural networks, but the prohibitive amount of computations behind current NAS methods requires further investigations in improving the sample efficiency and the network evaluation cost to get better results in a shorter time. In this paper, we present a novel scalable Monte Carlo Tree Search (MCTS) based NAS agent, named AlphaX, to tackle these two aspects. AlphaX improves the search efficiency by adaptively balancing the exploration and exploitation at the state level, and by a Meta-Deep Neural Network (DNN) to predict network accuracies for biasing the search toward a promising region. To amortize the network evaluation cost, AlphaX accelerates MCTS rollouts with a distributed design and reduces the number of epochs in evaluating a network by transfer learning, which is guided with the tree structure in MCTS. In 12 GPU days and 1000 samples, AlphaX found an architecture that reaches 97.84\% top-1 accuracy on CIFAR-10, and 75.5\% top-1 accuracy on ImageNet, exceeding SOTA NAS methods in both the accuracy and sampling efficiency. Particularly, we also evaluate AlphaX on NASBench-101, a large scale NAS dataset; AlphaX is 3x and 2.8x more sample efficient than Random Search and Regularized Evolution in finding the global optimum. Finally, we show the searched architecture improves a variety of vision applications from Neural Style Transfer, to Image Captioning and Object Detection.