SDAICVLGASFeb 17, 2022

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

arXiv:2202.08509v1
Originality Synthesis-oriented
AI Analysis

This work addresses wake word spotting for TV wake-up scenarios, offering a potential product solution, but it is incremental as it applies known pruning techniques to a specific domain.

The paper tackles the challenge of wake word spotting under noisy conditions by designing a compact audio-visual system using lip visual information, achieving significant performance improvements over single-modality systems and reducing network parameters and computations without performance degradation.

Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes