NEAIDec 7, 2024

Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search

arXiv:2412.05505v13 citationsh-index: 4ASAP
Originality Incremental advance
AI Analysis

This work addresses the problem of high computational demands for deploying spiking transformers on edge devices like mobile phones, representing an incremental improvement in model compression techniques for this domain.

The paper tackles the challenge of deploying large spiking transformer models on resource-constrained edge devices by introducing a heterogeneous quantization method that compresses these models with layer-wise quantization using uniform or power-of-two schemes with mixed bit resolutions. The result is an average effective resolution of 3.14-3.67 bits with less than 1% accuracy drop, achieving model compression rates of 8.71x-10.19x and energy reductions of 5.69x-10.2x while maintaining high accuracy levels on various datasets.

Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have garnered significant interest, incorporating attention mechanisms akin to their counterparts in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However, deploying large spiking transformer models on resource-constrained edge devices such as mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization. Our approach optimizes the quantization of each layer using one of two distinct quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions. Our heterogeneous quantization demonstrates the feasibility of maintaining high performance for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a model compression rate of 8.71x-10.19x for standard floating-point spiking transformers. Moreover, the proposed approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes