DC AIApr 16, 2025

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

arXiv:2504.15296v117 citationsh-index: 6Proceedings of the 2025 4th International Conference on Big Data, Information and Computer Network

Originality Incremental advance

AI Analysis

This work addresses scalability challenges for cloud-based AI inference services, offering incremental improvements in performance and resource utilization.

The study tackled the problem of managing dynamic workloads in cloud AI inference services by proposing a hybrid framework for real-time load balancing and autoscaling, resulting in a 35% improvement in load balancing efficiency and a 28% reduction in response delay compared to conventional solutions.

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud AI inference services, focusing on real-time load balancing and autoscaling strategies. The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting. This multi-layered approach enables the system to anticipate workload fluctuations and proactively adjust resources, ensuring maximum resource utilisation and minimising latency. Furthermore, the incorporation of a decentralised decision-making process within the model serves to enhance fault tolerance and reduce response time in scaling operations. Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35\ and reduces response delay by 28\, thereby exhibiting a substantial optimization effect in comparison with conventional scalability solutions.

View on arXiv PDF

Similar