Abdolah Amirany

AR
h-index19
3papers
21citations
Novelty53%
AI Score44

3 Papers

DCJul 19, 2024
Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi

The increasing demand for deploying large Mixture-of-Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 3.81 to 4.00, 13.59 to 14.17, and 7.24 to 7.40 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important.

ARApr 16
Towards Topology-Aware Very Large-Scale Photonic AI Accelerators

Belal Jahannia, Abdolah Amirany, Hamed Dalir

The rapid growth of deep neural networks (DNNs) has exposed fundamental limitations in electronic accelerators, where data movement dominates energy consumption, commonly referred to as the memory wall. Photonic accelerators offer a compelling alternative due to their inherent parallelism and high-speed matrix operations. However, existing research largely focuses on device-level innovations, leaving system-level scalability insufficiently explored. In this paper, we present a scalable photonic accelerator architecture based on a modular scale-out paradigm using 4 X 4 photonic tensor core units. We perform a systematic architectural analysis that incorporates the practical scaling limits of photonic hardware, including insertion loss, fanout penalties, and laser power limits, which restrict monolithic photonic scaling. Through evaluation on representative DNN workloads (GoogleNet, ResNet-18, MobileNet, and AlphaGo Zero) with up to 1024 processing elements, we identify a topology-dominated scaling bottleneck in the photonic domain, termed the Utilization Wall, where performance is governed by grid topology rather than hardware size. We further establish the Symmetric Grid Rule, demonstrating that symmetric topologies improve utilization by up to 6X while reducing memory access by over 40% compared to linear configurations, which reveal that topology-aware scaling is essential for achieving energy-efficient and high-performance photonic AI accelerators.

LGMay 10, 2025
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

HamidReza Imani, Jiaxin Peng, Peiman Mohseni et al.

The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.