Small-footprint slimmable networks for keyword spotting
This work addresses the need for efficient on-device keyword spotting models with varying memory and compute budgets, though it is incremental as it adapts existing slimmable network techniques to this domain.
The authors tackled the problem of small-footprint keyword spotting by applying slimmable neural networks to create super-nets from CNNs and Transformers, enabling extraction of sub-networks with under 250k parameters that match or outperform models trained from scratch on Alexa and Google Speech Commands data.
In this work, we present Slimmable Neural Networks applied to the problem of small-footprint keyword spotting. We show that slimmable neural networks allow us to create super-nets from Convolutioanl Neural Networks and Transformers, from which sub-networks of different sizes can be extracted. We demonstrate the usefulness of these models on in-house Alexa data and Google Speech Commands, and focus our efforts on models for the on-device use case, limiting ourselves to less than 250k parameters. We show that slimmable models can match (and in some cases, outperform) models trained from scratch. Slimmable neural networks are therefore a class of models particularly useful when the same functionality is to be replicated at different memory and compute budgets, with different accuracy requirements.