SDAIASOct 7, 2021

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

arXiv:2110.03174v1
Originality Incremental advance
AI Analysis

This work addresses acoustic event detection for human-centered contexts, but it is incremental as it builds on prior knowledge transfer methods.

The paper tackled acoustic event detection by transferring voice knowledge from a speaker dataset to enrich the pipeline, resulting in improved mean average precision from 0.134 to 0.292 for a CNN baseline and from 0.351 to 0.361 for a TALNet baseline.

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes