CVNov 3, 2023Code
VQPy: An Object-Oriented Approach to Modern Video AnalyticsShan Yu, Zhenting Zhu, Yu Chen et al.
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
MAApr 28
Pythia: Toward Predictability-Driven Agent-Native LLM ServingShan Yu, Junyi Shu, Yuanjiang Ni et al.
As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty -- yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.
DCMay 6, 2025
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM ServingShan Yu, Jiarong Xing, Yifan Qiao et al.
Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems$\unicode{x2014}$the lack of $\textit{cross-model memory coordination}$, which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models' runtime demands. Evaluations on real-world traces show that Prism achieves more than $2\times$ cost savings and $3.3\times$ SLO attainment compared to state-of-the-art systems.
CRJun 20, 2020
Securing Smart Home Edge Devices against Compromised Cloud ServersRahmadi Trimananda, Ali Younis, Thomas Kwa et al.
Smart home IoT systems often rely on cloud-based servers for communication between components. Although there exists a body of work on IoT security, most of it focuses on securing clients (i.e., IoT devices). However, cloud servers can also be compromised. Existing approaches do not typically protect smart home systems against compromised cloud servers. This paper presents FIDELIUS: a runtime system for secure cloud-based storage and communication even in the presence of compromised servers. FIDELIUS's design is tailored for smart home systems that have intermittent Internet access. In particular, it supports local control of smart home devices in the event that communication with the cloud is lost, and provides a consistency model using transactions to mitigate inconsistencies that can arise due to network partitions. We have implemented FIDELIUS, developed a smart home benchmark that uses FIDELIUS, and measured FIDELIUS's performance and power consumption. Our experiments show that compared to the commercial Particle.io framework, FIDELIUS reduces more than 50% of the data communication time and increases battery life by 2X. Compared to PyORAM, an alternative (ORAM-based) oblivious storage implementation, FIDELIUS has 4-7X faster access times with 25-43X less data transferred.