AIMay 12

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

arXiv:2605.1167850.5
Predicted impact top 73% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying large VLA models on commodity GPUs, this system-level optimization removes the VRAM bottleneck without altering the model, enabling broader accessibility.

This work enables memory-efficient inference of large Vision-Language-Action models (e.g., 21.52GB) on VRAM-constrained GPUs (16GB) via CPU-GPU memory swapping, achieving up to 3.55x speedup over existing offloading without model modification.

End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes