ASSDMar 11

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

arXiv:2603.1692258.81 citationsh-index: 16
Predicted impact top 56% in AS · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses efficiency for on-device speech recognition, offering a novel method to reduce computational overhead, though it is incremental as it modifies existing transformer architectures.

The paper tackled the problem of self-attention's quadratic scaling limiting transformer-based speech models on edge devices by introducing the Learnable Pulse Accumulator (LPA), an O(n) replacement that achieved a 10.61% word error rate on LibriSpeech test-clean, a 7.24 percentage point increase over the 3.37% baseline, with a 3.27x speedup on Apple M4 Pro hardware.

Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA's near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes