Weijia Song

DC
h-index3
5papers
4citations
Novelty58%
AI Score41

5 Papers

OSNov 29, 2023
Cascade: A Platform for Delay-Sensitive Edge Intelligence

Weijia Song, Thiago Garrett, Yuting Yang et al.

Interactive intelligent computing applications are increasingly prevalent, creating a need for AI/ML platforms optimized to reduce per-event latency while maintaining high throughput and efficient resource management. Yet many intelligent applications run on AI/ML platforms that optimize for high throughput even at the cost of high tail-latency. Cascade is a new AI/ML hosting platform intended to untangle this puzzle. Innovations include a legacy-friendly storage layer that moves data with minimal copying and a "fast path" that collocates data and computation to maximize responsiveness. Our evaluation shows that Cascade reduces latency by orders of magnitude with no loss of throughput.

77.8AIMar 24
ABSTRAL: Automatic Design of Multi-Agent Systems Through Iterative Refinement and Topology Optimization

Weijia Song, Jiashu Yue, Zhe Pang

How should multi-agent systems be designed, and can that design knowledge be captured in a form that is inspectable, revisable, and transferable? We introduce ABSTRAL, a framework that treats MAS architecture as an evolving natural-language document, an artifact refined through contrastive trace analysis. Three findings emerge. First, we provide a precise measurement of the multi-agent coordination tax: under fixed turn budgets, ensembles achieve only 26% turn efficiency, with 66% of tasks exhausting the limit, yet still improve over single-agent baselines by discovering parallelizable task decompositions. Second, design knowledge encoded in documents transfers: topology reasoning and role templates learned on one domain provide a head start on new domains, with transferred seeds matching coldstart iteration 3 performance in a single iteration. Third, contrastive trace analysis discovers specialist roles absent from any initial design, a capability no prior system demonstrates. On SOPBench (134 bank tasks, deterministic oracle), ABSTRAL reaches 70% validation / 65.96% test pass rate with a GPT-4o backbone. We release the converged documents as inspectable design rationale.

DBNov 3, 2025
Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yuting Yang, Tiancheng Yuan, Jamal Hashim et al.

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

DCNov 30, 2023
Keep Your Friends Close: Leveraging Affinity Groups to Accelerate AI Inference Workflows

Thiago Garrett, Weijia Song, Roman Vitenberg et al.

AI inference workflows are typically structured as a pipeline or graph of AI programs triggered by events. As events occur, the AIs perform inference or classification tasks under time pressure to respond or take some action. Standard techniques that reduce latency in other streaming settings (such as caching and optimization-driven scheduling) are of limited value because AI data access patterns (models, databases) change depending on the triggering event: a significant departure from traditional streaming. In this work, we propose a novel affinity grouping mechanism that makes it easier for developers to express application-specific data access correlations, enabling coordinated management of data objects in server clusters hosting streaming inference tasks. Our proposals are thus complementary to other approaches such as caching and scheduling. Experiments confirm the limitations of standard techniques, while showing that the proposed mechanism is able to maintain significantly lower latency as workload and scale-out increase, and yet requires only minor code changes.

DCFeb 27, 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows

Yuting Yang, Andrea Merlina, Weijia Song et al.

We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries: a computing style often seen in applications that interact with users in support of image processing and natural language processing. In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity. We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently, placing tasks where data dependencies will be satisfied, collocating tasks from the same job (when this will not overload the host or its GPU), and efficiently managing GPU memory. Comparison with other state of the art schedulers shows a significant reduction in completion times while requiring the same amount or even fewer resources. In one case, just half the servers were needed for processing the same workload.