LGMar 9
DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action ModelsZihao Zheng, Hangyu Cao, Sicheng Tian et al.
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.
DCMar 9
RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA modelsZihao Zheng, Sicheng Tian, Hangyu Cao et al.
Vision Language Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) inference offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Mainstream environment-oriented edge-cloud partitioning methods are susceptible to interference from visual noise; (2) Existing edge-cloud partitioning methods overlook the step-wise redundancy unique to embodied tasks, thereby disrupting the physical continuity of motion. To address these issues, we propose a novel ECC inference framework, termed RAPID. Specifically, we developed an implementation tailored to the proposed framework. Experiments demonstrate this achieves a speedup of up to 1.73x with only 5%~7% overhead.
ROMar 7
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics AwarenessZihao Zheng, Zhihao Mao, Xingyue Zhou et al.
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
ARMar 9
GOMA: Geometrically Optimal Mapping via Analytical Modeling for Spatial AcceleratorsWulve Yang, Hailong Zou, Rui Zhou et al.
General matrix multiplication (GEMM) on spatial accelerators is highly sensitive to mapping choices in both execution efficiency and energy consumption. However, the mapping space exhibits combinatorial explosion, which makes it extremely challenging to obtain optimal mappings within an acceptable time budget. Existing approaches typically face challenges: They often lack global-optimality guarantees and become prohibitively slow as the mapping space grows. To address these limitations, we propose \textsc{GOMA}, a geometric-abstraction-based, globally optimal GEMM mapping framework via analytical modeling, which achieves efficient solving while guaranteeing optimality. \textsc{GOMA} introduces, from first principles, a geometric abstraction for GEMM mapping, yielding an exact analytical energy objective with $O(1)$ evaluation for any given mapping. The objective is highly accurate. \textsc{GOMA} then formulates mapping selection as an integer optimization problem under hardware and mapping constraints, using the analytical energy model as the objective to automate mapping search. \textsc{GOMA} can quickly compute a global-optimal mapping for any (GEMM workload, target hardware) pair, achieving this for the first time in mapping space exploration. Experiments confirm that across representative accelerators and large language model prefill workloads, \textsc{GOMA} improves the energy--delay product (EDP) by $2.24$--$4.24\times$ over SOTA mappers, while accelerating time-to-solution by $3.83$--$73.6\times$.