FRILL: A Non-Semantic Speech Embedding for Mobile Devices
This work addresses the bottleneck of run-time performance for speech embeddings in mobile settings, enabling applications like mobile health tasks, though it is incremental as it builds on existing TRILL embeddings.
The authors tackled the problem of deploying learned speech representations on mobile devices by proposing lightweight non-semantic speech embedding models, achieving a model (FRILL) that is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL with only a 2% average accuracy decrease.
Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight non-semantic speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speed-up techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our models and code are publicly available.