ASLGSDMay 20, 2025

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

arXiv:2505.14517v14 citationsh-index: 2INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the challenge of speaker extraction in spatially dynamic environments for audio processing applications, offering a more practical solution than manual tracking, though it is incremental as it builds on deep non-linear spatial filtering.

The paper tackled the problem of extracting moving speakers in dynamic scenarios where existing methods require precise, time-dependent directional cues, and proposed a weakly guided method that only needs the target's initial position, achieving performance that outperforms a mismatched strongly guided method.

Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes