Qingxiu Liu

2papers

2 Papers

65.4ARMay 5
Fletch: File-System Metadata Caching in Programmable Switches

Qingxiu Liu, Jiazhen Cai, Siyuan Sheng et al.

Fast and scalable metadata management across multiple metadata servers is crucial for distributed file systems to handle numerous files and directories. Client-side caching of frequently accessed metadata can mitigate server loads, but incurs significant overhead and complexity in maintaining cache consistency when the number of clients increases. We explore caching in programmable switches by serving file-system metadata requests from multiple clients on the switch data plane. Despite prior efforts on in-switch key-value caching, they fail to address the path dependencies specific to file-system semantics. We propose Fletch, an in-switch file-system metadata caching framework that leverages programmable switches to serve file-system metadata requests from multiple clients directly in the switch data plane. Unlike prior in-switch key-value caching approaches, Fletch addresses file-system-specific path dependencies under stringent switch resource constraints. We implement Fletch atop Hadoop HDFS and evaluate it on a Tofino-switch testbed using real-world file-system metadata workloads. Fletch achieves up to 181.6% higher throughput than vanilla HDFS and complements client-side caching with throughput gains of up to 139.6%. It also incurs low latencies and limited switch resource usage.

62.7LGApr 3
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

Qingxiu Liu, Cyril Y. He, Hanser Jiang et al.

Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.