LGAIMay 1, 2025

Scaling On-Device GPU Inference for Large Generative Models

arXiv:2505.00232v111 citationsh-index: 6Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This addresses the need for privacy and efficiency in deploying complex generative AI on resource-constrained devices like mobile and desktop platforms.

The paper tackles the problem of enabling on-device GPU inference for large generative models, presenting ML Drift, which allows execution of models with 10 to 100x more parameters than existing on-device models and achieves an order-of-magnitude performance improvement over open-source GPU inference engines.

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes