Trever Schirmer

7.5DCApr 17

New Kids: An Architecture and Performance Investigation of Second-Generation Serverless Platforms

Trever Schirmer, Aris Wiegand, Lucca di Benedetto et al.

With the ever-increasing usage of serverless computing in both industry and academia, it is essential to understand the mechanisms that power the underlying platforms. As serverless is more than ten years old, there are different platforms with vastly different approaches. We show that, next to the traditional and popular platforms, a second generation of serverless platform has emerged. While first-generation platforms are based on containerized, centralized execution, the new generation leverages lightweight isolates and edge deployment. This evolution reduces warm request latency from approximately 40 ms to around 10 ms and reduces cold starts to an afterthought, but limits the execution environment. In this paper, we gather and analyze all publicly available information to provide detailed insights into the underlying architecture of seven platforms and then run a microbenchmark-based evaluation totaling more than 38 million function calls to gain a deeper understanding their performance.

57.2DCApr 29Code

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi et al.

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

Trever Schirmer

2 Papers