OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
For practitioners deploying large VLA models on commodity GPUs, this system-level optimization removes the VRAM bottleneck without altering the model, enabling broader accessibility.
This work enables memory-efficient inference of large Vision-Language-Action models (e.g., 21.52GB) on VRAM-constrained GPUs (16GB) via CPU-GPU memory swapping, achieving up to 3.55x speedup over existing offloading without model modification.
End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.