17.5LGApr 22
Scalable AI Inference: Performance Analysis and Optimization of AI Model ServingHung Cuong Pham, Fatih Gedikli
AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to identify bottlenecks in the inference pipeline. Based on the baseline results, optimization strategies are introduced at multiple levels of the serving stack to improve efficiency and scalability. The optimized system is then reevaluated under the same workload conditions, and the results are compared with the baseline using statistical analysis to quantify the impact of the applied improvements. The findings demonstrate practical strategies for achieving efficient and scalable AI inference with BentoML. The study examines how latency and throughput scale under varying workloads, how optimizations at the runtime, service, and deployment levels affect response time, and how deployment in a single-node K3s cluster influences resilience during disruptions.
AIFeb 24, 2025
Evaluating the Effectiveness of Large Language Models in Automated News Article SummarizationLionel Richy Panlap Houamegni, Fatih Gedikli
The automation of news analysis and summarization presents a promising solution to the challenge of processing and analyzing vast amounts of information prevalent in today's information society. Large Language Models (LLMs) have demonstrated the capability to transform vast amounts of textual data into concise and easily comprehensible summaries, offering an effective solution to the problem of information overload and providing users with a quick overview of relevant information. A particularly significant application of this technology lies in supply chain risk analysis. Companies must monitor the news about their suppliers and respond to incidents for several critical reasons, including compliance with laws and regulations, risk management, and maintaining supply chain resilience. This paper develops an automated news summarization system for supply chain risk analysis using LLMs. The proposed solution aggregates news from various sources, summarizes them using LLMs, and presents the condensed information to users in a clear and concise format. This approach enables companies to optimize their information processing and make informed decisions. Our study addresses two main research questions: (1) Are LLMs effective in automating news summarization, particularly in the context of supply chain risk analysis? (2) How effective are various LLMs in terms of readability, duplicate detection, and risk identification in their summarization quality? In this paper, we conducted an offline study using a range of publicly available LLMs at the time and complemented it with a user study focused on the top performing systems of the offline experiments to evaluate their effectiveness further. Our results demonstrate that LLMs, particularly Few-Shot GPT-4o mini, offer significant improvements in summary quality and risk identification.