Hongjun Dai

3.9ARMay 21Code

FASE: FPGA-Assisted Syscall Emulation for Rapid End-to-End Processor Performance Validation

Chengzhen Meng, Xiuzhuang Chen, Bingcai Sui et al.

The rapid advancement of AI workloads and domain-specific architectures has led to increasingly diverse processor microarchitectures, whose design exploration requires fast and accurate performance validation. However, traditional workflows defer validation process until RTL design and SoC integration are complete, significantly prolonging development and iteration cycle. In this work, we present FASE framework, FPGA-Assisted Syscall Emulation, the first work for adapt syscall emulation on FPGA platforms, enabling complex multi-thread benchmarks to directly run on the processor design without integrating SoC or target OS for early-stage performance validation. FASE introduces three key innovations to address three critical challenges for adapting FPGA-based syscall emulation: (1) only a minimal CPU interface is exposed, with other hardware components untouched, addressing the lack of a unified hardware interface in FPGA systems; (2) a Host-Target Protocol (HTP) is proposed to minimize cross-device data traffic, mitigating the low-bandwidth and high-latency communication between FPGA and host; and (3) a host-side runtime is proposed to remotely handle Linux-style system calls, addressing the challenge of cross-device syscall delegation. Experiments ware conducted on Xilinx FPGA with open-sourced RISC-V SMP processor Rocket. With single-thread CoreMark, FASE introduces less than 1% performance error and achieves over 2000x higher efficiency compared to Proxy Kernel due to FPGA acceleration. With complex OpenMP benchmarks, FASE demonstrates over 96% performance validation accuracy for most single-thread workloads and over 91.5% for most multi-thread workloads compared to full SoC validation, significantly reducing development complexity and time-to-feedback. All components of FASE framework are released as open-source.

DCDec 1, 2025

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Wenbin Zhu, Zhaoyan Shen, Zili Shao et al.

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.

Hongjun Dai

2 Papers