ASSDAug 13, 2020

Cross attentive pooling for speaker verification

arXiv:2008.05983v26 citations
AI Analysis

This addresses speaker verification for noisy video data, offering an incremental improvement over existing pooling methods.

The paper tackles text-independent speaker verification in noisy 'in the wild' videos by proposing Cross Attentive Pooling (CAP), which uses context across reference-query pairs to generate discriminative embeddings, outperforming comparable pooling strategies on the VoxCeleb dataset.

The goal of this paper is text-independent speaker verification where utterances come from 'in the wild' videos and may contain irrelevant signal. While speaker verification is naturally a pair-wise problem, existing methods to produce the speaker embeddings are instance-wise. In this paper, we propose Cross Attentive Pooling (CAP) that utilizes the context information across the reference-query pair to generate utterance-level embeddings that contain the most discriminative information for the pair-wise matching problem. Experiments are performed on the VoxCeleb dataset in which our method outperforms comparable pooling strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes