85.0DCApr 19
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research AgendaMinxian Xu, Jingfeng Wu, Shengye Song et al.
The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in training and inference, present significant challenges. Traditional systems are often unable to meet these requirements, necessitating the integration of cloud-native and distributed architectures. This paper explores the role of cloud platforms and distributed systems in supporting the scalability, efficiency, and optimization of LLMs. We discuss the complexities of LLM deployment, including data management, resource optimization, and the need for microservices, autoscaling, and hybrid cloud-edge solutions. Additionally, we examine emerging research trends, such as serverless inference, quantum computing, and federated learning, and their potential to drive the next phase of LLM innovation. The paper concludes with a roadmap for future developments, emphasizing the need for continued research, standardization, and cross-sector collaboration to sustain the growth of LLMs in both research and enterprise applications.
76.9DCMar 13
Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention PiggybackingZizhao Mo, Junlin Chen, Huanle Xu et al.
Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.
CVApr 21, 2017
Robust and Fast Decoding of High-Capacity Color QR Codes for Mobile ApplicationsZhibo Yang, Huanle Xu, Jianyuan Deng et al.
The use of color in QR codes brings extra data capacity, but also inflicts tremendous challenges on the decoding process due to chromatic distortion, cross-channel color interference and illumination variation. Particularly, we further discover a new type of chromatic distortion in high-density color QR codes, cross-module color interference, caused by the high density which also makes the geometric distortion correction more challenging. To address these problems, we propose two approaches, namely, LSVM-CMI and QDA-CMI, which jointly model these different types of chromatic distortion. Extended from SVM and QDA, respectively, both LSVM-CMI and QDA-CMI optimize over a particular objective function to learn a color classifier. Furthermore, a robust geometric transformation method and several pipeline refinements are proposed to boost the decoding performance for mobile applications. We put forth and implement a framework for high-capacity color QR codes equipped with our methods, called HiQ. To evaluate the performance of HiQ, we collect a challenging large-scale color QR code dataset, CUHK-CQRC, which consists of 5390 high-density color QR code samples. The comparison with the baseline method [2] on CUHK-CQRC shows that HiQ at least outperforms [2] by 188% in decoding success rate and 60% in bit error rate. Our implementation of HiQ in iOS and Android also demonstrates the effectiveness of our framework in real-world applications.
CVSep 19, 2015
Similar Handwritten Chinese Character Discrimination by Weakly Supervised LearningZhibo Yang, Huanle Xu, Keda Fu et al.
Traditional approaches for handwritten Chinese character recognition suffer in classifying similar characters. In this paper, we propose to discriminate similar handwritten Chinese characters by using weakly supervised learning. Our approach learns a discriminative SVM for each similar pair which simultaneously localizes the discriminative region of similar character and makes the classification. For the first time, similar handwritten Chinese character recognition (SHCCR) is formulated as an optimization problem extended from SVM. We also propose a novel feature descriptor, Gradient Context, and apply bag-of-words model to represent regions with different scales. In our method, we do not need to select a sized-fixed sub-window to differentiate similar characters. The unconstrained property makes our method well adapted to high variance in the size and position of discriminative regions in similar handwritten Chinese characters. We evaluate our proposed approach over the CASIA Chinese character data set and the results show that our method outperforms the state of the art.