Shouxu Lin

DC
h-index34
3papers
7citations
Novelty48%
AI Score40

3 Papers

86.6DCApr 28
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Shouxu Lin, Zhiyuan Guo, Jiaxin Lin

LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloading frameworks rely on prefetching data into local GPU HBM. This approach underutilizes system resources by introducing HBM contention, squandering memory capacity, and creating pipeline bubbles. We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. We propose DAK, an end-to-end direct-access memory offloading framework that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory (SMEM). To maximize remote access performance, DAK introduces a greedy algorithm to determine optimal per-operation offloading ratios, alongside active congestion control and TMA multicast to eliminate interconnect bottlenecks and read amplification. Evaluations across diverse architectures show that DAK achieves near-optimal bandwidth aggregation, with up to 3$\times$ performance gains on NVLink-C2C and 1.8$\times$ on PCIe systems compared to state-of-the-art memory offloading baselines.

DCOct 12, 2025
FLAMMABLE: A Multi-Model Federated Learning Framework with Multi-Model Engagement and Adaptive Batch Sizes

Shouxu Lin, Zimeng Pan, Yuhang Yao et al.

Multi-Model Federated Learning (MMFL) is an emerging direction in Federated Learning (FL) where multiple models are trained in parallel, generally on various datasets. Optimizing the models' accuracies and training times in the MMFL setting requires adapting to data and system heterogeneity across clients as in single-model FL; these challenges are amplified in the MMFL setting due to additional heterogeneity across models. Neither existing solutions nor naïve extensions of single-model FL frameworks efficiently address these challenges. To bridge this gap, we propose FLAMMABLE, a comprehensive MMFL training framework. FLAMMABLE optimizes model training by intelligently adapting client batch sizes while engaging them to train multiple carefully chosen models, depending on their system capabilities, in each training round. To evaluate FLAMMABLE, we develop the first benchmark platform for the MMFL setting, which may enable future reproducible MMFL research. Extensive evaluations on multiple datasets and models show that FLAMMABLE boosts the MMFL time-to-accuracy performance by 1.1$\sim$10.0$\times$ while improving the final model accuracy by 1.3$\sim$5.4\% compared to several known baselines.

HCFeb 19, 2020
Emotion Recognition Through Observer's Physiological Signals

Yang Liu, Tom Gedeon, Sabrina Caldwell et al.

Emotion recognition based on physiological signals is a hot topic and has a wide range of applications, like safe driving, health care and creating a secure society. This paper introduces a physiological dataset PAFEW, which is obtained using movie clips from the Acted Facial Expressions in the Wild (AFEW) dataset as stimuli. To establish a baseline, we use the electrodermal activity (EDA) signals in this dataset and extract 6 features from each signal series corresponding to each movie clip to recognize 7 emotions, i.e., Anger, Disgust, Fear, Happy, Surprise, Sad and Neutral. Overall, 24 observers participated in our collection of the training set, including 19 observers who participated in only one session watching 80 videos from 7 classes and 5 observers who participated multiple times and watched all the videos. All videos were presented in an order balanced fashion. Leave-one-observer-out was employed in this classification task. We report the classification accuracy of our baseline, a three-layer network, on this initial training set while training with signals from all participants, only single participants and only multiple participants. We also investigate the recognition accuracy of grouping the dataset by arousal or valence, which achieves 68.66% and 72.72% separately. Finally, we provide a two-step network. The first step is to classify the features into high/low arousal or positive/negative valence by a network. Then the arousal/valence middle output of the first step is concatenated with feature sets as input of the second step for emotion recognition. We found that adding arousal or valence information can help to improve the classification accuracy. In addition, the information of positive/negative valence boosts the classification accuracy to a higher degree on this dataset.