LGDec 17, 2025

EdgeFlex-Transformer: Transformer Inference for Edge Devices

arXiv:2512.19741v11 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient transformer inference for edge computing, offering a practical solution for resource-constrained environments, though it is incremental as it builds on existing compression techniques.

The paper tackles the challenge of deploying large transformer models on edge devices by proposing a multi-stage optimization pipeline that compresses a Vision Transformer from 632 million parameters, achieving a 76% reduction in memory usage and over 6x lower latency on CIFAR-10 while maintaining accuracy.

Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected components and leverage AWQ to quantize the remaining model weights and activations to INT8 with minimal accuracy degradation. Our experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline. This framework offers a practical path toward efficient transformer inference on edge platforms, and opens future avenues for integrating dynamic sparsity and Mixture-of-Experts (MoE) architectures to further scale performance across diverse tasks.

View on arXiv PDF

Similar